class: title-slide, center, bottom # 3 - The _Tidyverse_ ## Data Science with R · Summer 2021 ### Uli Niemann · Knowledge Management & Discovery Lab #### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/) .courtesy[📷 Photo courtesy of Ulrich Arendt] --- name: motivation ## Why is (base) `R` hard to learn? `R` has some idiosyncrasies that make it hard for learners who are used to other programming languages, e.g.: .pull-left60[ - unhelpful help `?print` - too many functions `colnames()`, `names()` - inconsistent names `read.csv()`, `load()`, `readRDS()` - clumsy console output `print(iris)` - high flexibility (_is this bad?_) - too many ways to select variables: `df$x`, `df$"x"`, `df[,"x"]`, or `df[[1]]` ] .pull-right40[ <img src="figures//03-unhelpful_help.png" width="100%" /> ❓ _"invisibly"?_ ❓ _"generic function"?_ ❓ _"class"?_ ] .footnote[ See a more comprehensive list: Robert A. Muenchen. ["Why R is Hard to Learn"](http://r4stats.com/articles/why-r-is-hard-to-learn/). r4stats.com. Accessed 19.07.2018. ] --- ## Base `R` vs. tidyverse code .pull-left[ 🤔 _"What does this base `R` code?"_ ```r aggregate(iris[, "Sepal.Length"], list(Species = iris[, "Species"]), mean) ``` ``` ## Species x ## 1 setosa 5.006 ## 2 versicolor 5.936 ## 3 virginica 6.588 ``` {{content}} ] -- ❓ _"What are the square brackets [ for?"_ ⟶ They are used for subsetting a data frame. {{content}} -- ❓ _"What is list()?"_ ⟶ It's a type of R object. {{content}} -- ❓ _"Why is the mean() function seemingly applied without argument?"_ ⟶ It is being passed to the subsets of the data frame. {{content}} -- ❓ _"In the second line, why does the first "Species" does not need to be quoted, but the second "Species" does?"_ ... -- .pull-right[ 😎 **tidyverse** equivalent: ```r library(dplyr) group_by(iris, Species) %>% summarize(avg_sl = mean(Sepal.Length)) ``` ``` ## # A tibble: 3 x 2 ## Species avg_sl ## <fct> <dbl> ## 1 setosa 5.01 ## 2 versicolor 5.94 ## 3 virginica 6.59 ``` ] --- name: tidyverse-packages ## The Tidyverse .pull-left[ <img src="figures//03-tidyverse.png" width="100%" /> ] .pull-right[ Quote from the [Tidyverse website](https://www.tidyverse.org/): .content-box-gray[ .font110[ "**R packages for data science.** The tidyverse is an **opinionated collection of R packages designed for data science**. All packages share an underlying **design philosophy, grammar, and data structures**." ] ] → collection of open-source `R` packages mainly for data wrangling and visualization → shared conventions and common APIs across all Tidyverse packages ] --- ## Installation ```r # Install all Tidyverse packages install.packages("tidyverse") # Attach core packages library(tidyverse) ``` The meta-package [`tidyverse`](https://www.tidyverse.org/packages/) contains 26 packages. When running `library(tidyverse)`, only the **core** tidyverse packages become available in your current `R` session. The core packages are: .pull-left60[ **Core-Packages**: - `ggplot2`: creation of graphics - `dplyr`: data wrangling - `tidyr`: data reshaping - `readr`: import of flat data files, e.g. csv - `tibble`: enhanced data frames - `stringr`: string manipulation - `forcats`: factor manipulation - `purrr`: functions for working with list columns ] .pull-right40[ <img src="figures//03-tidyverse_overview.png" width="100%" /> .footnote[ .content-box-purple[ Figure source: Joseph Rickert. ["What is the tidyverse?"](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/). R Views. Accessed 13.07.2019. ] ] ] --- name: dplyr class: center, middle, inverse <img src="figures//03-dplyr.png" width="250px" /> --- ## Data wrangling with `dplyr` 🛠️ .pull-left70[ 🤔 _"What can I do with `dplyr`?"_ - get an overview of a tibble with `glimpse()` - select a subset of columns with `select()` - filter a subset of rows with `filter()` - add new or change existing columns with `mutate()` - pick a subset of rows with `slice()` - reorder rows with `arrange()` - group rows by a grouping column with `group_by()` - calculate a summary (per group) with `summarize()` - join two distinct tibbles by a common column with `*join()` - ... (and more) **Consistent API design:** - first argument of each of these **verbs** is a data frame <!-- - subsequent arguments reference columns in _non-standard evaluation_ notation (no quotation marks) --> - the output is (usually) also a data frame ] .pull-right30[ <img src="figures//03-dplyr.png" width="100%" /> ] ??? - tibble: "enhanced" data frame --- ## Case study: customer bookings data .pull-left60[ .content-box-gray[ .font90[ The company behind a travel price aggregator website wants to analyze its booking data to optimize the website's usability and thus improve their customers' travel experience. The data is organized into two files: - `bookings.csv`: hotel bookings - `properties.csv`: hotel facilities ] ] ] .pull-right40[ <img src="figures//03-expedia.png" width="100%" /> ] .font90[ ```r library(tidyverse) ``` ```r bookings <- read_csv("../data/bookings.csv") ``` ``` ## ## -- Column specification ------------------------------------------------------------------ ## cols( ## booker_id = col_character(), ## property_id = col_double(), ## room_nights = col_double(), ## price_per_night = col_double(), ## checkin_day = col_character(), ## for_business = col_logical(), ## status = col_character(), ## review_score = col_double() ## ) ``` ] .font80[ [Data source](https://github.com/drsimonj/tidyverse_tutorial-useR2018/tree/master/data). ] --- ## Get an overview wit `glimpse()` ```r glimpse(bookings) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ booker_id <chr> "215934017ba98c09f30dedd29237b43dad5c7b5f", "7f590fd6d318248a486~ ## $ property_id <dbl> 2668, 4656, 4563, 4088, 2188, 4171, 2907, 5141, 1696, 1901, 2188~ ## $ room_nights <dbl> 4, 5, 6, 7, 4, 2, 4, 4, 1, 7, 1, 9, 6, 4, 5, 5, 2, 2, 2, 2, 1, 3~ ## $ price_per_night <dbl> 91.46696, 106.50500, 86.99137, 92.36562, 104.83894, 109.98188, 1~ ## $ checkin_day <chr> "mon", "tue", "wed", "fri", "tue", "fri", "fri", "wed", "wed", "~ ## $ for_business <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F~ ## $ status <chr> "cancelled", "cancelled", "stayed", "stayed", "stayed", "cancell~ ## $ review_score <dbl> NA, NA, 6.258123, 5.953598, 6.434745, NA, 7.599461, NA, 6.972698~ ``` --- ## Select columns with `select()` 👆 .panelset[ .panel[.panel-name[Select one column] .pull-left70[ Select the column `review_score`. ```r select(bookings, review_score) ``` ``` ## # A tibble: 10,000 x 1 ## review_score ## <dbl> ## 1 NA ## 2 NA ## 3 6.26 ## 4 5.95 ## 5 6.43 ## 6 NA ## 7 7.60 ## 8 NA ## 9 6.97 ## 10 NA ## # ... with 9,990 more rows ``` ] .pull-right30[ <img src="figures//03-dplyr_select.png" width="100%" /> ] .content-box-blue[ The output is **always** a `data.frame` or `tibble`, regardless of whether a single or multiple columns are selected. ] ] .panel[.panel-name[Select multiple columns] .pull-left70[ Select multiple columns by specifying column names as additional arguments. ```r select(bookings, review_score, status) ``` ``` ## # A tibble: 10,000 x 2 ## review_score status ## <dbl> <chr> ## 1 NA cancelled ## 2 NA cancelled ## 3 6.26 stayed ## 4 5.95 stayed ## 5 6.43 stayed ## 6 NA cancelled ## 7 7.60 stayed ## 8 NA cancelled ## 9 6.97 stayed ## 10 NA cancelled ## # ... with 9,990 more rows ``` ] .pull-right30[ <img src="figures//03-dplyr_select.png" width="100%" /> ] ] .panel[.panel-name[Exclude columns] .pull-left70[ Negative selection: select all but specific columns using `-`. ```r select(bookings, -booker_id) ``` ``` ## # A tibble: 10,000 x 7 ## property_id room_nights price_per_night checkin_day for_business status review_score ## <dbl> <dbl> <dbl> <chr> <lgl> <chr> <dbl> ## 1 2668 4 91.5 mon FALSE cancelled NA ## 2 4656 5 107. tue FALSE cancelled NA ## 3 4563 6 87.0 wed FALSE stayed 6.26 ## 4 4088 7 92.4 fri FALSE stayed 5.95 ## 5 2188 4 105. tue FALSE stayed 6.43 ## 6 4171 2 110. fri FALSE cancelled NA ## 7 2907 4 116. fri FALSE stayed 7.60 ## 8 5141 4 111. wed FALSE cancelled NA ## 9 1696 1 106. wed FALSE stayed 6.97 ## 10 1901 7 82.3 sat FALSE cancelled NA ## # ... with 9,990 more rows ``` ] .pull-right30[ <img src="figures//03-dplyr_select.png" width="100%" /> ] ] ] --- ## Helper functions for `select()` **Helper functions** facilitate selecting multiple columns whose names satisfy a specific criterion. .pull-left60[ Helper function | Description :------------------------- | :---------------------------- `contains("abc")` | Names containing `"abc"` `starts_with("abc")` | Names starting with `"abc"` `ends_with("abc")` | Names ending with `"abc"` `num_range("a", 1:3)` | Names matching the numerical range `a1`, `a2`, `a3` `any_of(c("ab", "c"))` | Any of the names within a character vector `all_of(c("ab", "c"))` | All of the names within a character vector `matches(".id")` | Names matching the regular expr. `".id"` `everything()` | All (remaining) names `last_col()` | Last column ] --- ## Examples of `select()` with helper functions <!-- .content-box-purple[ --> <!-- ] --> .panelset[ .panel[.panel-name[Combined helpers] > _"Select all columns whose names either contain 'id' or end with 'night'."_ ```r select(bookings, contains("id"), ends_with("night")) ``` ``` ## # A tibble: 10,000 x 3 ## booker_id property_id price_per_night ## <chr> <dbl> <dbl> ## 1 215934017ba98c09f30dedd29237b43dad5c7b5f 2668 91.5 ## 2 7f590fd6d318248a48665f7f7db529aca40c84f5 4656 107. ## 3 10f0f138e8bb1015d3928f2b7d828cbb50cd0804 4563 87.0 ## 4 7b55021a4160dde65e31963fa55a096535bcad17 4088 92.4 ## 5 6694a79d158c7818cd63831b71bac91286db5aff 2188 105. ## 6 d0358740d5f15e85523f94ab8219f25d8c017347 4171 110. ## 7 944e568a0b511b9140bcc7c7c80c68c3edd3a86f 2907 116. ## 8 95476c2ef6bb9e3c227b46f1283310737fa13b7a 5141 111. ## 9 df235631a4c281c01e007b6a4c364e07f5843994 1696 106. ## 10 ff610140227d40d27daa01f0dcd5d64ea82e833d 1901 82.3 ## # ... with 9,990 more rows ``` ] .panel[.panel-name[`any_of()`] > _"Select any of the two columns `room_nights`, `this_column_is_not_there`."_ ```r select(bookings, any_of(c("room_nights", "this_column_is_not_there"))) ``` ``` ## # A tibble: 10,000 x 1 ## room_nights ## <dbl> ## 1 4 ## 2 5 ## 3 6 ## 4 7 ## 5 4 ## 6 2 ## 7 4 ## 8 4 ## 9 1 ## 10 7 ## # ... with 9,990 more rows ``` ] .panel[.panel-name[`all_of()`] > _"Select both columns `room_nights`, `this_column_is_not_there`."_ ```r select(bookings, all_of(c("room_nights", "this_column_is_not_there"))) ``` ``` ## Error: Can't subset columns that don't exist. ## x Column `this_column_is_not_there` doesn't exist. ``` ] ] --- ## Filter rows with `filter()` .pull-left70[ Filter bookings which stayed, i.e., didn't cancel. ```r filter(bookings, status == "stayed") ``` ``` ## # A tibble: 7,775 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 2 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 3 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 4 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 5 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 6 aeff2e2d3d23b1~ 2188 1 109. wed FALSE stayed ## 7 5a1442f4c7237e~ 2307 9 84.2 sat FALSE stayed ## 8 39804a2e3fb2e4~ 2907 6 112. sun FALSE stayed ## 9 e150e559405ef2~ 2870 4 127. sat FALSE stayed ## 10 4e9c7c21dfcf27~ 1674 5 102. sun FALSE stayed ## # ... with 7,765 more rows, and 1 more variable: review_score <dbl> ``` ```r # the same as: filter(bookings, status != "cancelled") ``` ] .pull-right30[ <img src="figures//03-dplyr_filter.png" width="100%" /> ] --- ## Filter rows with `filter()` .pull-left70[ Combine multiple conditions with `&`. ```r filter( bookings, status == "stayed" & !is.na(review_score) & between(price_per_night, 90, 120) ) ``` ``` ## # A tibble: 3,032 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 2 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 3 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 4 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 5 39804a2e3fb2e4~ 2907 6 112. sun FALSE stayed ## 6 4e9c7c21dfcf27~ 1674 5 102. sun FALSE stayed ## 7 8537874da3fa74~ 4420 2 106. sat FALSE stayed ## 8 0fcce2f5be8e34~ 1951 3 110. wed FALSE stayed ## 9 9864af3aba5bcb~ 1981 3 96.9 sun FALSE stayed ## 10 ccd24f993579b4~ 2307 3 115. fri FALSE stayed ## # ... with 3,022 more rows, and 1 more variable: review_score <dbl> ``` ] .pull-right30[ <img src="figures//03-dplyr_filter.png" width="100%" /> ] --- exclude: true class: middle, exercise-blue ## Your turn 1 Create a new tibble `x` that - only contains the columns `room_nights` and `review_score`, and - only contains the bookings with a price per night of less than 80. 1. _...using base `R` functions only._ 2. _...using the `dplyr` functions `select()` and `filter()`._
03
:
00
--- ## Sort rows with `arrange()` ```r arrange(bookings, price_per_night) # default: sort in ascending order ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 028c395e745c41~ 3096 1 39.4 tue FALSE stayed ## 2 037bec01b4f11d~ 3619 1 45.8 wed TRUE stayed ## 3 7341b69d8de545~ 3983 1 47.7 thu TRUE stayed ## 4 cbedb290616cb7~ 1172 2 53.2 thu TRUE stayed ## 5 dff926426fdf09~ 4354 6 54.2 fri FALSE stayed ## 6 b28ca9bd608e46~ 3118 2 54.8 sun TRUE stayed ## 7 e2aff0e6df1abe~ 4116 5 55.7 tue FALSE stayed ## 8 28421366c10d89~ 4391 4 56.6 fri FALSE stayed ## 9 a07a0a8a9d8924~ 4150 2 57.7 sat FALSE cance~ ## 10 97a4a1dcdce1ab~ 1172 2 61.0 tue TRUE stayed ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` ```r arrange(bookings, desc(price_per_night)) ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 3d62487f31468b~ 4931 1 273. tue TRUE cance~ ## 2 dc6add0d9538bf~ 2305 1 272. fri TRUE stayed ## 3 c57c799c4fa86c~ 5034 1 263. tue TRUE stayed ## 4 0dbc70d94f448b~ 1819 1 259. fri TRUE cance~ ## 5 1d0d33ccf02e4e~ 2306 1 257. fri FALSE cance~ ## 6 6b70130a6cd8bb~ 3916 1 256. tue TRUE stayed ## 7 c478924be4be9c~ 2464 1 255. sun FALSE stayed ## 8 52779fc42be8ef~ 3832 1 255. tue TRUE stayed ## 9 23a1258e825c9f~ 1971 1 254. tue TRUE stayed ## 10 58a1a0debacb2e~ 2888 1 252. sun FALSE stayed ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` --- ## Select rows by position with `slice()` ```r x <- arrange(bookings, desc(review_score)) slice(x, 1:10) ``` ``` ## # A tibble: 10 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 51c3e2e855240f~ 1970 3 126. fri FALSE stayed ## 2 99f91abc431bad~ 2175 1 189. wed FALSE stayed ## 3 1d7930a998f002~ 3089 6 77.4 thu FALSE stayed ## 4 58c94e9418fe51~ 3089 5 91.5 sat FALSE stayed ## 5 164c4797e41a32~ 3890 1 232. sat TRUE stayed ## 6 3119d16c994efc~ 3089 6 93.4 fri FALSE stayed ## 7 10f0f138e8bb10~ 3089 1 175. sun FALSE stayed ## 8 714d7f95f1766d~ 1855 2 129. thu FALSE stayed ## 9 b4d77fe4ea2472~ 3089 2 83.3 sat FALSE stayed ## 10 68686391598302~ 3089 5 106. fri FALSE stayed ## # ... with 1 more variable: review_score <dbl> ``` --- ## Chaining multiple operations ⛓️ `dplyr` code is very intuitive and expressive, but also quite verbose. 🤔 _"How to organize the code so that it remains readable even for a high number of operations?"_ -- > Example: For all bookers who didn't cancel their trip and checked in on a friday, we are interested in the 5 highest review scores. Only the columns `price_per_night` and `review_score` should be included in a new tibble which is sorted by `review_score` in descending order. --- ## Chaining multiple operations ⛓️ .panelset[ .panel[.panel-name[Nesting] Solution 1 (nesting): .pull-left70[ ```r select( slice( arrange( filter( bookings, !status == "cancelled" & checkin_day == "fri"), desc(review_score) ), 1:5 ), price_per_night, review_score ) ``` ``` ## # A tibble: 5 x 2 ## price_per_night review_score ## <dbl> <dbl> ## 1 126. 10 ## 2 93.4 9.80 ## 3 106. 9.77 ## 4 148. 9.64 ## 5 225. 9.49 ``` ] .pull-right30[ <img src="figures//03-Russian-Matroshka2.jpg" width="639" /> ] ] .panel[.panel-name[Save & overwrite intermediate results] Solution 2 (save and overwrite intermediate results): ```r x <- filter(bookings, !status == "cancelled" & checkin_day == "fri") x <- arrange(x, review_score) x <- slice(x, 1:5) x <- select(x, price_per_night, review_score) x ``` ``` ## # A tibble: 5 x 2 ## price_per_night review_score ## <dbl> <dbl> ## 1 123. 2.79 ## 2 109. 3.31 ## 3 127. 3.75 ## 4 111. 4.09 ## 5 79.1 4.15 ``` ] .panel[.panel-name[Save result of each operation as new object] Solution 3 (save each intermediate results as new object): ```r x1 <- filter(bookings, !status == "cancelled" & checkin_day == "fri") x2 <- arrange(x1, desc(review_score)) x3 <- slice(x2, 1:5) x4 <- select(x2, price_per_night, review_score) x4 ``` _There is a mistake in the above code. Can you spot it?_ ] ] ??? S1: - could you easily read the code after half a year from now? - arrange is the last operation but is read first S2: - very hard to spot mistakes you introduced in one of the earlier steps - interested in the 5 highest review scores: desc(review_score) S3: - 4th line -> ~x2~ → x3 - copy-and-paste problems: even if you know you make these mistakes you still will make the mistakes - naming is difficult: what happens if I want to insert another operations later between x1 and x2? --- name: pipe ## The pipe operator `%>%` .pull-left70[ - The **pipe operator** `%>%` is used to pass information from one operation to the next. - Its main purpose is to **express a sequence of operations**. - Using the pipe, a function's output becomes the first argument of the subsequent function. - Thus, the pipe operator helps to write **code that is easy to read and understand**. ```r bookings %>% filter(!status == "cancelled" & checkin_day == "fri") %>% arrange(desc(review_score)) %>% slice(1:5) %>% select(price_per_night, review_score) ``` ] .pull-right30[ <img src="figures//03-magrittr.png" width="90%" /> ] > "From table `bookings`, filter all non-canceled bookings with check-in on a Friday > THEN sort by review score in descending order > THEN take the top-5 bookings > THEN return price per night and review score." .font80[ - In RStudio, the shortcut for `%>%` is **Ctrl+⇧Shift+M**. - Although the pipe operator is implemented in the package `magrittr`, we do not need to load this package explicitly when we have loaded `tidyverse`. ] ??? - `%>%`: chain a new dplyr operation - `+`: add a new layer to a ggplot2 plot --- exclude: true class: middle, exercise-blue ## Your turn 2
05
:
00
Use `%>%` to write a pipeline which extracts all bookings with Friday as check-in day while only returning the columns `property_id` and `status`. --- ## Adding new columns with `mutate()` ➕ Create a new column for the **total price** of a booking, which is the product of `price_per_night` and `room_nights`. .panelset[ .panel[.panel-name[`mutate()`] ```r bookings %>% mutate(total_price = price_per_night * room_nights) ``` ``` ## # A tibble: 10,000 x 9 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 2 more variables: review_score <dbl>, total_price <dbl> ``` .content-box-yellow[ Note that a new variable will be appended as the last column of the data frame. ] ] .panel[.panel-name[`mutate() + select()`] (Only show `price_per_night`, `room_nights`, `total_price`) ```r bookings %>% mutate(total_price = price_per_night * room_nights) %>% select(price_per_night, room_nights, total_price) ``` ``` ## # A tibble: 10,000 x 3 ## price_per_night room_nights total_price ## <dbl> <dbl> <dbl> ## 1 91.5 4 366. ## 2 107. 5 533. ## 3 87.0 6 522. ## 4 92.4 7 647. ## 5 105. 4 419. ## 6 110. 2 220. ## 7 116. 4 465. ## 8 111. 4 446. ## 9 106. 1 106. ## 10 82.3 7 576. ## # ... with 9,990 more rows ``` ] .panel[.panel-name[`mutate(..., .before = ...)`] (Insert the new column before `booker_id`.) ```r bookings %>% mutate(total_price = price_per_night * room_nights, .before = booker_id) ``` ``` ## # A tibble: 10,000 x 9 ## total_price booker_id property_id room_nights price_per_night checkin_day for_business ## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <lgl> ## 1 366. 215934017~ 2668 4 91.5 mon FALSE ## 2 533. 7f590fd6d~ 4656 5 107. tue FALSE ## 3 522. 10f0f138e~ 4563 6 87.0 wed FALSE ## 4 647. 7b55021a4~ 4088 7 92.4 fri FALSE ## 5 419. 6694a79d1~ 2188 4 105. tue FALSE ## 6 220. d0358740d~ 4171 2 110. fri FALSE ## 7 465. 944e568a0~ 2907 4 116. fri FALSE ## 8 446. 95476c2ef~ 5141 4 111. wed FALSE ## 9 106. df235631a~ 1696 1 106. wed FALSE ## 10 576. ff6101402~ 1901 7 82.3 sat FALSE ## # ... with 9,990 more rows, and 2 more variables: status <chr>, review_score <dbl> ``` ] .panel[.panel-name[`mutate(..., .after = ...)`] (Insert the new column after the second column (`property_id`).) ```r bookings %>% mutate(total_price = price_per_night * room_nights, .after = 2) ``` ``` ## # A tibble: 10,000 x 9 ## booker_id property_id total_price room_nights price_per_night checkin_day for_business ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <lgl> ## 1 215934017~ 2668 366. 4 91.5 mon FALSE ## 2 7f590fd6d~ 4656 533. 5 107. tue FALSE ## 3 10f0f138e~ 4563 522. 6 87.0 wed FALSE ## 4 7b55021a4~ 4088 647. 7 92.4 fri FALSE ## 5 6694a79d1~ 2188 419. 4 105. tue FALSE ## 6 d0358740d~ 4171 220. 2 110. fri FALSE ## 7 944e568a0~ 2907 465. 4 116. fri FALSE ## 8 95476c2ef~ 5141 446. 4 111. wed FALSE ## 9 df235631a~ 1696 106. 1 106. wed FALSE ## 10 ff6101402~ 1901 576. 7 82.3 sat FALSE ## # ... with 9,990 more rows, and 2 more variables: status <chr>, review_score <dbl> ``` ] ] ??? - Syntax: mutate(name of new variable = equation) --- ## Replacing existing columns with `mutate()` ➕ Convert the column `property_id` from `character` into `factor`. ```r bookings %>% mutate(property_id = as.factor(property_id)) ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <fct> <dbl> <dbl> <chr> <lgl> <chr> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` --- exclude: true class: middle, exercise-blue ## Your turn 3
03
:
00
Write a pipeline which calculates the mean-centered `price_per_night` as new column names `centered_ppn`. The returned data frame should only contain this new column. _You need `mutate()` and `select()`._ --- ## `mutate(across(...))` Apply a transformation to **multiple columns**. .panelset[ .panel[.panel-name[Select columns by name] > Transform columns whose names end with "id" to factor variables. ```r bookings %>% mutate(across(ends_with("id"), as.factor)) ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <fct> <fct> <dbl> <dbl> <chr> <lgl> <chr> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` ] .panel[.panel-name[Select columns by logical condition] > Transform all columns of type `character` to factor variables. ```r bookings %>% mutate(across(where(is.character), as.factor)) ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <fct> <dbl> <dbl> <dbl> <fct> <lgl> <fct> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` ] <!-- .panel[.panel-name[]] --> ] <!-- https://www.youtube.com/watch?v=0229Uq2hkJo&list=PLNUVZZ6hfXX1tyUykCWShOKZdIB0TIhtM&index=18 --> ??? - `mutate_at()`: select one or more columns to apply a function on with `vars()` helper function - `mutate_if()`: apply a function to columns that fulfill a condition - `mutate_all()`: apply a function to all columns --- exclude: true ## `mutate_at/if/all()` ```r # Create new columns with suffix '_chr' bookings %>% mutate_at(vars(ends_with("id")), list(chr = as.character)) # applies to booker_id and property_id ``` ``` ## # A tibble: 10,000 x 10 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 3 more variables: review_score <dbl>, ## # booker_id_chr <chr>, property_id_chr <chr> ``` --- exclude: true ## `mutate_at/if/all()` ```r # Convert all character variables to factors bookings %>% mutate_if(is.character, as.factor) # applies to booker_id, checkin_day and status ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <fct> <dbl> <dbl> <dbl> <fct> <lgl> <fct> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` ??? - place holder: the placeholder `.` is the dynamic variable when using `mutate_at()` and other functions in the tidyverse. --- ## Summarize many rows with `summarize()` 🗜 .pull-left70[ `summarize()` (or `summarise()`) performs some kind of aggregation and returns a summary table with fewer rows and removes all columns that are irrelevant to the calculation. ] .pull-right30[ <img src="figures//03-dplyr_summarize.png" width="100%" /> ] .panelset[ .panel[.panel-name[A single one-valued stat] > What is the average review score over all bookings? ```r bookings %>% summarize(review_score = mean(review_score, na.rm = TRUE)) ``` ``` ## # A tibble: 1 x 1 ## review_score ## <dbl> ## 1 7.22 ``` (one row, one column) ] .panel[.panel-name[Multiple one-valued stats] > What is the total number of bookings, the number of bookings without review score, and the average review score over all bookings? ```r bookings %>% summarize(n = n(), # Total no. of bookings n_miss = sum(is.na(review_score)), # No. of bookings w/o review score review_score = mean(review_score, na.rm = TRUE)) # Avg. review score ``` ``` ## # A tibble: 1 x 3 ## n n_miss review_score ## <int> <int> <dbl> ## 1 10000 3817 7.22 ``` (one row, multiple columns) ] .panel[.panel-name[Multiple n-valued stats] > What is the price range over all bookings? ```r bookings %>% summarize(statistic = c("min", "max"), value = range(price_per_night)) ``` ``` ## # A tibble: 2 x 2 ## statistic value ## <chr> <dbl> ## 1 min 39.4 ## 2 max 273. ``` (multiple rows, multiple columns) ] ] --- exclude: true class: middle, exercise-blue ## Your turn 4
08
:
00
Obtain a one-row summary tibble of the bookings data containing the following statistics: - The number of rows (Hint: use the function `n()`) - The number of `"stayed"` bookings - The mean of the _total price_ (Hint: you have to compute the total price first) The output tibble should contain three columns and just one row. --- ## Grouping with `group_by()` `group_by()` lets us perform operations for each _group_ separately. ```r bookings %>% * group_by(for_business) %>% summarize(n = n(), review_avg = mean(review_score, na.rm = TRUE)) ``` ``` ## # A tibble: 2 x 3 ## for_business n review_avg ## <lgl> <int> <dbl> ## 1 FALSE 6285 7.50 ## 2 TRUE 3715 6.85 ``` .footnote[ .content-box-gray[ Use `ungroup()` to undo the grouping. ] ] -- ```r class(bookings) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` ```r bookings %>% group_by(for_business) %>% class() ``` ``` ## [1] "grouped_df" "tbl_df" "tbl" "data.frame" ``` --- exclude: true class: middle, exercise-blue ## Your turn 5
10
:
00
Write a pipeline that yields a tibble containing only the most expensive booking of each property. Return only the property id and the price of the most expensive booking. You should get as many rows as there are unique properties: ```r n_distinct(bookings$property_id) # convenience function for `length(unique(bookings$property_id))` ``` ``` ## [1] 4178 ``` 1. Calculate `price_total` using `mutate()`. 1. Group by `property_id` using `group_by()`. 1. Filter rows where the total price is equal to the maximum of the total price (for that property using `filter()`). 1. Select the columns `property_id` and `price_total` using `select()`. --- ## Group by multiple columns ```r # Average score by travel type and check-in day of the week bookings %>% group_by(for_business, checkin_day) %>% summarize(mean_review = mean(review_score, na.rm = TRUE)) ``` ``` ## # A tibble: 14 x 3 ## # Groups: for_business [2] ## for_business checkin_day mean_review ## <lgl> <chr> <dbl> ## 1 FALSE fri 7.60 ## 2 FALSE mon 7.34 ## 3 FALSE sat 7.57 ## 4 FALSE sun 7.57 ## 5 FALSE thu 7.42 ## 6 FALSE tue 7.32 ## 7 FALSE wed 7.43 ## 8 TRUE fri 7.04 ## 9 TRUE mon 6.87 ## 10 TRUE sat 7.05 ## 11 TRUE sun 6.93 ## 12 TRUE thu 6.79 ## 13 TRUE tue 6.75 ## 14 TRUE wed 6.84 ``` --- ## Count the number of rows per group with `count()` Syntax: ```r bookings %>% count(x, y, ...) ``` ...is a shortcut for... ```r bookings %>% group_by(x, y, ...) %>% summarize(n = n()) ``` -- .pull-left[ ```r bookings %>% count(for_business, status) ``` ``` ## # A tibble: 4 x 3 ## for_business status n ## <lgl> <chr> <int> ## 1 FALSE cancelled 1762 ## 2 FALSE stayed 4523 ## 3 TRUE cancelled 463 ## 4 TRUE stayed 3252 ``` ] .pull-right[ ```r bookings %>% group_by(for_business, status) %>% summarize(n = n()) ``` ``` ## # A tibble: 4 x 3 ## # Groups: for_business [2] ## for_business status n ## <lgl> <chr> <int> ## 1 FALSE cancelled 1762 ## 2 FALSE stayed 4523 ## 3 TRUE cancelled 463 ## 4 TRUE stayed 3252 ``` ] --- name: join class: center, middle, inverse ## Combining multiple data frames --- ## Customer bookings data > What is the number of bookings per destination (city)? To answer this question, we have to combine the two data frames `bookings` and `properties`. .panelset[ .panel[.panel-name[`bookings`] ```r bookings <- read_csv("../data/bookings.csv") ``` ``` ## # A tibble: 10,000 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 215934017ba98c~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d31824~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb10~ 4563 6 87.0 wed FALSE stayed ## 4 7b55021a4160dd~ 4088 7 92.4 fri FALSE stayed ## 5 6694a79d158c78~ 2188 4 105. tue FALSE stayed ## 6 d0358740d5f15e~ 4171 2 110. fri FALSE cance~ ## 7 944e568a0b511b~ 2907 4 116. fri FALSE stayed ## 8 95476c2ef6bb9e~ 5141 4 111. wed FALSE cance~ ## 9 df235631a4c281~ 1696 1 106. wed FALSE stayed ## 10 ff610140227d40~ 1901 7 82.3 sat FALSE cance~ ## # ... with 9,990 more rows, and 1 more variable: review_score <dbl> ``` ] .panel[.panel-name[`properties`] ```r properties <- read_csv("../data/properties.csv") properties ``` ``` ## # A tibble: 4,178 x 5 ## property_id destination property_type nr_rooms facilities ## <dbl> <chr> <chr> <dbl> <chr> ## 1 2668 Brisbane Hotel 32 airport shuttle,free wifi,garden,breakf~ ## 2 4656 Brisbane Hotel 39 on-site restaurant,pool,airport shuttle~ ## 3 4563 Brisbane Apartment 9 laundry ## 4 4088 Brisbane Apartment 9 kitchen,laundry,free wifi ## 5 2188 Brisbane Apartment 4 parking,kitchen,bbq,free wifi,game cons~ ## 6 4171 Brisbane Apartment 5 kitchen,pool,laundry,parking,free wifi,~ ## 7 2907 Brisbane Hotel 22 airport shuttle,on-site restaurant,brea~ ## 8 5141 Brisbane Hotel 20 breakfast,free wifi,on-site restaurant,~ ## 9 1696 Brisbane Apartment 5 free wifi,laundry,pool,game console,par~ ## 10 1901 Brisbane Apartment 11 free wifi,bbq,laundry,breakfast,pool,pa~ ## # ... with 4,168 more rows ``` ] <!-- .panel[.panel-name[]] --> <!-- .panel[.panel-name[]] --> ] --- ## Joining data frames .panelset[ .panel[.panel-name[Data] Consider the following two toy data frames `x` and `y`: <!-- > A mutating join allows you to combine variables from two tables. --> <!-- > It first matches observations by their keys, then copies across variables --> <!-- > from one table to the other. -- [R for Data Science: Mutating joins](http://r4ds.had.co.nz/relational-data.html#mutating-joins) --> .pull-left60[ ```r (x <- tibble(id = c(1L, 2L, 3L), x = c("x1", "x2", "x3"))) ``` ``` ## # A tibble: 3 x 2 ## id x ## <int> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ```r (y <- tibble(id = c(1L, 2L, 4L), y = c("y1", "y2", "y4"))) ``` ``` ## # A tibble: 3 x 2 ## id y ## <int> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] .pull-right40[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> ] .font80[Source of the figures on this and the following slides: <https://github.com/gadenbuie/tidyexplain>] ] .panel[.panel-name[Basic syntax] Basic syntax: ```r ***_join(df1, df2, by = "<ID>") ``` ] .panel[.panel-name[Inner] .pull-left70[ Join two data frames `x` and `y`. The result is a data frame containing all rows from `x` with matching values in `y` for column `id` and all columns from `x` and `y`. ```r inner_join(x, y, by = "id") ``` ``` ## # A tibble: 2 x 3 ## id x y ## <int> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-inner-join.gif" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Left] .pull-left70[ Join two data frames `x` and `y`. The result is a data frame containing all rows from `x` and all columns from `x` and `y`. In rows where there are no matching values in `y` for column `id`, the values of the columns that are present only in `y` are set to `NA`. ```r left_join(x, y, by = "id") ``` ``` ## # A tibble: 3 x 3 ## id x y ## <int> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-left-join.gif" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Left 2] .pull-left70[ In case of multiple matches, **all** combinations of the matches are returned: ```r (y_extra <- bind_rows(y, tibble(id = 2L, y = "y5"))) ``` ``` ## # A tibble: 4 x 2 ## id y ## <int> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ## 4 2 y5 ``` ```r left_join(x, y_extra, by = "id") ``` ``` ## # A tibble: 4 x 3 ## id x y ## <int> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 2 x2 y5 ## 4 3 x3 <NA> ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-left-join-extra.gif" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Right] .pull-left70[ Join two data frames `x` and `y`. The result is a data frame containing all rows from `y` and all columns from `x` and `y`. In rows where there are no matching values in `x` for column `id`, the values of the columns that are present only in `x` are set to `NA`. ```r right_join(x, y, by = "id") ``` ``` ## # A tibble: 3 x 3 ## id x y ## <int> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-right-join.gif" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Full] .pull-left70[ Join two data frames `x` and `y`. The result is a data frame containing all rows and all columns from `x` and `y`. In rows where there are no matching values for column `id`, the values of the columns from the other data frame are set to `NA`. ```r full_join(x, y, by = "id") ``` ``` ## # A tibble: 4 x 3 ## id x y ## <int> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-full-join.gif" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Semi] .pull-left70[ Return all rows and columns from `x` with matching rows in `y`. ```r semi_join(x, y, by = "id") ``` ``` ## # A tibble: 2 x 2 ## id x ## <int> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-semi-join.gif" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Anti] .pull-left70[ Return all rows and columns from `x` **without** matching rows in `y`. ```r anti_join(x, y, by = "id") ``` ``` ## # A tibble: 1 x 2 ## id x ## <int> <chr> ## 1 3 x3 ``` ] .pull-right30[ <img src="figures//03-join_x.png" width="100%" style="display: block; margin: auto;" /> <img src="figures//03-anti-join.gif" width="100%" style="display: block; margin: auto;" /> ] ] ] ??? - mutating join: combine variables from two tables matching observations based on an id variable - filtering join: match observations like in mutating join, but mergin will affect observations, not columns --- exclude: true class: middle, exercise-blue ## Your turn 6
03
:
00
Calculate the number of bookings per city. Hint: Join the data frames `bookings` and `properties` and _count_ the number of rows per city. --- exclude: true ## Filtering joins > Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. (...) Semi-joins are useful for matching filtered summary tables back to the original rows. (...) Anti-joins are useful for diagnosing join mismatches -- [R for Data Science: Filtering Joins](http://r4ds.had.co.nz/relational-data.html#filtering-joins) --- exclude: true ## Joining data frames <img src="figures//03-dplyr_combine.png" width="60%" style="display: block; margin: auto;" /> --- name: tidyr class: center, middle, inverse <img src="figures//03-tidyr.png" width="250px" /> --- class: bottom, right background-image: url("figures/03-kriminalitaetsatlas-berlin.png") background-size: contain .pull-right[ .content-box-gray[ ## Tidy and untidy data .font80[ [Data source](https://www.govdata.de/web/guest/daten/-/details/kriminalitatsatlas-berlin) ] ] ] --- ## Data come in different shapes... .panelset[ .panel[.panel-name[`table1`] ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ] .panel[.panel-name[`table2`] ``` ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` ] .panel[.panel-name[`table3`] ``` ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ] .panel[.panel-name[`table4a` & `table4b`] ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` ] ] ??? - table1: each row contains information for the combination of country and year - table2: each row represents a combination of country, year and variable. The count column contains the values of cases and population in separate rows - table3: similar to table1, each row represents a combination of country and year, but instead of having separate columns for cases and population, they are stored as a string separated by a slash character - table 4: data is organized into two dfs; table4a contains data on cases, table4b on population; each row is a country, each column represents a year and the cells are the values of the table's variable for the combination of country and year --- ## Reshape data frames with [`tidyr`](https://tidyr.tidyverse.org/) .pull-left[ The `tidyverse` inherits its name from the term **tidy data**. Tidy data refers to a specific standardized dataset structure. Characteristics of tidy data: 1. Each **variable** must have its own column. 1. Each **observation** must have its own row. 1. Each **value** must have its own cell. Many of the tidyverse functions require a _tidy_ data frame input. The `tidyr` package contains functions to reshape "messy" into tidy data frames. ] .pull-right[ <img src="figures//03-tidyr_long_wide.png" width="100%" /> ] .footnote[ Further reading: - [R for Data Science: Tidy data](https://r4ds.had.co.nz/tidy-data.html) - Function overview: RStudio's [Data import cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf) ] ??? from r4ds: As a data scientist, you will encounter a lot of untidy data, eg., because it is easier to enter data into a spreadsheet in untidy way --- ## `pivot_longer()` and `pivot_wider()` `tidyr`'s two main functions are `pivot_longer()` and `pivot_wider()`. - `pivot_longer()` takes multiple columns and collapses them into **key-value pairs**. - `pivot_wider()` takes two or more columns (i.e., a key-value pair) and spreads them into **multiple columns**. <img src="figures//03-tidyr_pivot.png" width="45%" style="display: block; margin: auto;" /> --- ## `pivot_longer()` and `pivot_wider()` .content-box-blue[ .font90[ ```r (wide <- tibble(name = c("Alex","Ben","Cedric"), DataSciR = c(2.0,1.7,1.0), VisAnalytics = c(3.3,1.3,2.0))) ``` ``` ## # A tibble: 3 x 3 ## name DataSciR VisAnalytics ## <chr> <dbl> <dbl> ## 1 Alex 2 3.3 ## 2 Ben 1.7 1.3 ## 3 Cedric 1 2 ``` This data frame is in **wide** format: there are 3 variables (name of student, course name, grade), but only the student's name has its own column. ] ] -- .pull-left[ .font90[ Use `pivot_longer()` to collapse the two course columns into key-value pairs `course` and `grade`. ```r tidy <- wide %>% pivot_longer(cols = -name, names_to = "course", values_to = "grade") tidy ``` ``` ## # A tibble: 6 x 3 ## name course grade ## <chr> <chr> <dbl> ## 1 Alex DataSciR 2 ## 2 Alex VisAnalytics 3.3 ## 3 Ben DataSciR 1.7 ## 4 Ben VisAnalytics 1.3 ## 5 Cedric DataSciR 1 ## 6 Cedric VisAnalytics 2 ``` ] ] -- .pull-right[ .font90[ Use `pivot_wider()` to "unpivot" a data frame. It is useful when there are variables that form rows instead of columns. ```r tidy %>% pivot_wider(names_from = course, values_from = grade) ``` ``` ## # A tibble: 3 x 3 ## name DataSciR VisAnalytics ## <chr> <dbl> <dbl> ## 1 Alex 2 3.3 ## 2 Ben 1.7 1.3 ## 3 Cedric 1 2 ``` ] ] ??? - quotes for new / non-existing columns; no quotes when referring to existing colums --- ## Reshaping from "long" to "wide" .panelset[ .panel[.panel-name[Long data] .content-box-blue[ Compute the number of bookings per city and day: ```r day_order <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun") df <- bookings %>% inner_join(properties, by = "property_id") checkin_count <- df %>% count(destination, checkin_day) %>% mutate(checkin_day = factor(checkin_day, levels = day_order)) checkin_count ``` ``` ## # A tibble: 21 x 3 ## destination checkin_day n ## <chr> <fct> <int> ## 1 Amsterdam fri 1074 ## 2 Amsterdam mon 517 ## 3 Amsterdam sat 889 ## 4 Amsterdam sun 813 ## 5 Amsterdam thu 667 ## 6 Amsterdam tue 498 ## 7 Amsterdam wed 542 ## 8 Brisbane fri 162 ## 9 Brisbane mon 133 ## 10 Brisbane sat 114 ## # ... with 11 more rows ``` ] ] .panel[.panel-name[Wide data] 🤔 _"How can we create the following table from `checkin_count`?"_ ``` ## # A tibble: 3 x 8 ## destination mon tue wed thu fri sat sun ## <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 Amsterdam 517 498 542 667 1074 889 813 ## 2 Brisbane 133 148 128 162 162 114 153 ## 3 Tokyo 718 655 560 718 451 322 576 ``` ] .panel[.panel-name[Code] ```r checkin_count %>% pivot_wider(names_from = checkin_day, values_from = n) ``` ``` ## # A tibble: 3 x 8 ## destination fri mon sat sun thu tue wed ## <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 Amsterdam 1074 517 889 813 667 498 542 ## 2 Brisbane 162 133 114 153 162 148 128 ## 3 Tokyo 451 718 322 576 718 655 560 ``` `pivot_wider()` creates the new columns in the order the keys appear in the data. Hence, we can sort the rows by `checkin_day` to obtain a data frame with the day-of-week columns in the correct order. ] .panel[.panel-name[Wide data with correct order] Arrange by `checkin_day` before reshaping the data frame into wide format: ```r checkin_count %>% * arrange(checkin_day) %>% pivot_wider(names_from = checkin_day, values_from = n) ``` ``` ## # A tibble: 3 x 8 ## destination mon tue wed thu fri sat sun ## <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 Amsterdam 517 498 542 667 1074 889 813 ## 2 Brisbane 133 148 128 162 162 114 153 ## 3 Tokyo 718 655 560 718 451 322 576 ``` ] .panel[.panel-name[Undo] .pull-left[ Bring the data back into long format: ```r checkin_count %>% pivot_wider(names_from = checkin_day, values_from = n) %>% * pivot_longer(cols = c(mon, tue, wed, thu, * fri, sat, sun), names_to = "checkin_day", values_to = "n") ``` ``` ## # A tibble: 21 x 3 ## destination checkin_day n ## <chr> <chr> <int> ## 1 Amsterdam mon 517 ## 2 Amsterdam tue 498 ## 3 Amsterdam wed 542 ## 4 Amsterdam thu 667 ## 5 Amsterdam fri 1074 ## 6 Amsterdam sat 889 ## 7 Amsterdam sun 813 ## 8 Brisbane mon 133 ## 9 Brisbane tue 148 ## 10 Brisbane wed 128 ## # ... with 11 more rows ``` ] .pull-right[ Alternative variables selection: ```r checkin_count %>% pivot_wider(names_from = checkin_day, values_from = n) %>% * pivot_longer(cols = -destination, names_to = "checkin_day", values_to = "n") ``` ``` ## # A tibble: 21 x 3 ## destination checkin_day n ## <chr> <chr> <int> ## 1 Amsterdam fri 1074 ## 2 Amsterdam mon 517 ## 3 Amsterdam sat 889 ## 4 Amsterdam sun 813 ## 5 Amsterdam thu 667 ## 6 Amsterdam tue 498 ## 7 Amsterdam wed 542 ## 8 Brisbane fri 162 ## 9 Brisbane mon 133 ## 10 Brisbane sat 114 ## # ... with 11 more rows ``` ] ] ] ??? - join bookings and properties - count the number of bookings per city and check-in day of the week - convert checkin_day from character to factor variable --- exclude: true class: middle, exercise-blue ## Your turn 7
10
:
00
### Do properties appeal to business travellers and tourists alike? - Convert the column `for_business` to a factor with the levels `"business"` and `"tourist"`. - For each property, calculate the average review score given by business travelers and tourists. - Then, calculate the average review score difference between business travelers and tourists. --- ## `separate()` ```r properties %>% head() ``` ``` ## # A tibble: 6 x 5 ## property_id destination property_type nr_rooms facilities ## <dbl> <chr> <chr> <dbl> <chr> ## 1 2668 Brisbane Hotel 32 airport shuttle,free wifi,garden,breakfa~ ## 2 4656 Brisbane Hotel 39 on-site restaurant,pool,airport shuttle,~ ## 3 4563 Brisbane Apartment 9 laundry ## 4 4088 Brisbane Apartment 9 kitchen,laundry,free wifi ## 5 2188 Brisbane Apartment 4 parking,kitchen,bbq,free wifi,game conso~ ## 6 4171 Brisbane Apartment 5 kitchen,pool,laundry,parking,free wifi,g~ ``` Have a look at the `facilities` column. It indicates the availability of various facilities in the accommodation. --- ## `separate()` `separate()` splits up two or more variables that are clumped together in one column. ```r properties %>% # Split `facilities` by `,` into multiple columns. separate(facilities, into = paste0("facility_", 1:9), sep = ",") %>% head(5) ``` ``` ## Warning: Expected 9 pieces. Missing pieces filled with `NA` in 4091 rows [1, 2, 3, 4, 5, ## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...]. ``` ``` ## # A tibble: 5 x 13 ## property_id destination property_type nr_rooms facility_1 facility_2 facility_3 ## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 2668 Brisbane Hotel 32 airport shuttle free wifi garden ## 2 4656 Brisbane Hotel 39 on-site restaur~ pool airport shut~ ## 3 4563 Brisbane Apartment 9 laundry <NA> <NA> ## 4 4088 Brisbane Apartment 9 kitchen laundry free wifi ## 5 2188 Brisbane Apartment 4 parking kitchen bbq ## # ... with 6 more variables: facility_4 <chr>, facility_5 <chr>, facility_6 <chr>, ## # facility_7 <chr>, facility_8 <chr>, facility_9 <chr> ``` We get a warning because the number of facilities differ across properties. The maximum number of facilities is 9. If a property has less than 9 facilities, the remaining columns are filled with `NA`. -- .content-box-gray[ .font80[ ✅ `unite()` is the inverse function of `separate()` and combines multiple variables into one. For example, this operation could be useful for combining day, month and year columns into one date column. ] ] --- ## Nesting tables Common data structures are **hierarchical**, e.g. patient-centric data with repeat observations. **Nesting** allows to store collapsed data frames and simplifies data management. First, reshape `properties` into a tidy format: ```r tp <- properties %>% separate(facilities, into = paste0("facility_", 1:9), sep = ",") %>% pivot_longer(cols = starts_with("facility_"), names_to = "facility_nr", values_to = "facility") ``` ``` ## Warning: Expected 9 pieces. Missing pieces filled with `NA` in 4091 rows [1, 2, 3, 4, 5, ## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...]. ``` ```r head(tp) ``` ``` ## # A tibble: 6 x 6 ## property_id destination property_type nr_rooms facility_nr facility ## <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 2668 Brisbane Hotel 32 facility_1 airport shuttle ## 2 2668 Brisbane Hotel 32 facility_2 free wifi ## 3 2668 Brisbane Hotel 32 facility_3 garden ## 4 2668 Brisbane Hotel 32 facility_4 breakfast ## 5 2668 Brisbane Hotel 32 facility_5 pool ## 6 2668 Brisbane Hotel 32 facility_6 on-site restaurant ``` Now, the data frame contains a lot of duplicate information: 9x destination, property type and number of rooms for each property id. --- ## Nesting tables Use `nest()` to create a nested data frame. The nested data frame contains a column `facilities` which is a **list of data frames**: ```r nested_tp <- tp %>% nest(facilities = c(facility_nr, facility)) nested_tp ``` ``` ## # A tibble: 4,178 x 5 ## property_id destination property_type nr_rooms facilities ## <dbl> <chr> <chr> <dbl> <list> ## 1 2668 Brisbane Hotel 32 <tibble [9 x 2]> ## 2 4656 Brisbane Hotel 39 <tibble [9 x 2]> ## 3 4563 Brisbane Apartment 9 <tibble [9 x 2]> ## 4 4088 Brisbane Apartment 9 <tibble [9 x 2]> ## 5 2188 Brisbane Apartment 4 <tibble [9 x 2]> ## 6 4171 Brisbane Apartment 5 <tibble [9 x 2]> ## 7 2907 Brisbane Hotel 22 <tibble [9 x 2]> ## 8 5141 Brisbane Hotel 20 <tibble [9 x 2]> ## 9 1696 Brisbane Apartment 5 <tibble [9 x 2]> ## 10 1901 Brisbane Apartment 11 <tibble [9 x 2]> ## # ... with 4,168 more rows ``` .footnote[ .content-box-gray[ Use `unnest()` to expand a list column, such that each element of the list becomes a row. ] ] --- ## `drop_na()` Some of the rows in the nested `facilities` column contain `NA`'s. How can we remove them? .panelset[ .panel[.panel-name[Access nested data frame] ```r nested_tp$facilities[1] ``` ``` ## [[1]] ## # A tibble: 9 x 2 ## facility_nr facility ## <chr> <chr> ## 1 facility_1 airport shuttle ## 2 facility_2 free wifi ## 3 facility_3 garden ## 4 facility_4 breakfast ## 5 facility_5 pool ## 6 facility_6 on-site restaurant ## 7 facility_7 <NA> ## 8 facility_8 <NA> ## 9 facility_9 <NA> ``` ] .panel[.panel-name[`drop_na()` on list columns] The function `drop_na()` removes all rows with at least one missing value. ```r nested_tp_wo_na <- nested_tp %>% drop_na(facilities) # Did it work? nested_tp_wo_na$facilities[1] ``` ``` ## [[1]] ## # A tibble: 9 x 2 ## facility_nr facility ## <chr> <chr> ## 1 facility_1 airport shuttle ## 2 facility_2 free wifi ## 3 facility_3 garden ## 4 facility_4 breakfast ## 5 facility_5 pool ## 6 facility_6 on-site restaurant ## 7 facility_7 <NA> ## 8 facility_8 <NA> ## 9 facility_9 <NA> ``` ] .panel[.panel-name[`drop_na()` before nesting] `dplyr`/`tidyr` functions don't work recursively. Thus, we apply `drop_na()` before nesting. ```r nested_without_missing <- tp %>% * drop_na() %>% nest(facilities = c(facility_nr, facility)) nested_without_missing$facilities[[1]] ``` ``` ## # A tibble: 6 x 2 ## facility_nr facility ## <chr> <chr> ## 1 facility_1 airport shuttle ## 2 facility_2 free wifi ## 3 facility_3 garden ## 4 facility_4 breakfast ## 5 facility_5 pool ## 6 facility_6 on-site restaurant ``` ] ] --- exclude: true ## Bonus: `drop_na()` Be aware that most `dplyr`/`tidyr` functions cannot deal with list columns. However, we can leverage the `map()` function from the `purrr` package to apply the `drop_na()` function on each element of `facilities`. ```r nested_tp_wo_na <- nested_tp %>% mutate(facilities = map(facilities, drop_na)) nested_tp_wo_na$facilities[1] ``` ``` ## [[1]] ## # A tibble: 6 x 2 ## facility_nr facility ## <chr> <chr> ## 1 facility_1 airport shuttle ## 2 facility_2 free wifi ## 3 facility_3 garden ## 4 facility_4 breakfast ## 5 facility_5 pool ## 6 facility_6 on-site restaurant ``` --- class: center, middle, inverse name: readr <img src="figures//03-readr.png" width="250px" /> --- ## Data import .pull-left70[ The first step of every data analysis project is to import one or more datasets. The tidyverse provides seven packages for the import of various data formats: - `readr`: flat files (.csv, .tsv, ...) - `DBI`: databases (SQLite, MySQL, PostgreSQL, MonetDB, ...) - `haven`: foreign statistical formats (.sas, .sav, .dta) - `jsonlite`: json files - `readxl`: Excel files (.xls, .xlsx) - `rvest`: websites (.html) - `xml2`: xml files ] .pull-right30[ <img src="figures//03-readr.png" width="100%" /> ] .footnote[ .content-box-purple[ Function reference: RStudio's [Data import cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf) ] ] --- ## Data import: `bookings.csv` and `properties.csv` .panelset[ .panel[.panel-name[`bookings.csv`] ```r # library(readr) # readr is a core tidyverse package and hence doesn't need to be loaded separately. bookings <- read_csv("../data/bookings.csv") ``` ``` ## ## -- Column specification ------------------------------------------------------------------ ## cols( ## booker_id = col_character(), ## property_id = col_double(), ## room_nights = col_double(), ## price_per_night = col_double(), ## checkin_day = col_character(), ## for_business = col_logical(), ## status = col_character(), ## review_score = col_double() ## ) ``` ```r head(bookings, 3) ``` ``` ## # A tibble: 3 x 8 ## booker_id property_id room_nights price_per_night checkin_day for_business status ## <chr> <dbl> <dbl> <dbl> <chr> <lgl> <chr> ## 1 215934017ba98c0~ 2668 4 91.5 mon FALSE cance~ ## 2 7f590fd6d318248~ 4656 5 107. tue FALSE cance~ ## 3 10f0f138e8bb101~ 4563 6 87.0 wed FALSE stayed ## # ... with 1 more variable: review_score <dbl> ``` ] .panel[.panel-name[`properties.csv`] ```r properties <- read_csv("../data/properties.csv") properties ``` ``` ## # A tibble: 4,178 x 5 ## property_id destination property_type nr_rooms facilities ## <dbl> <chr> <chr> <dbl> <chr> ## 1 2668 Brisbane Hotel 32 airport shuttle,free wifi,garden,breakf~ ## 2 4656 Brisbane Hotel 39 on-site restaurant,pool,airport shuttle~ ## 3 4563 Brisbane Apartment 9 laundry ## 4 4088 Brisbane Apartment 9 kitchen,laundry,free wifi ## 5 2188 Brisbane Apartment 4 parking,kitchen,bbq,free wifi,game cons~ ## 6 4171 Brisbane Apartment 5 kitchen,pool,laundry,parking,free wifi,~ ## 7 2907 Brisbane Hotel 22 airport shuttle,on-site restaurant,brea~ ## 8 5141 Brisbane Hotel 20 breakfast,free wifi,on-site restaurant,~ ## 9 1696 Brisbane Apartment 5 free wifi,laundry,pool,game console,par~ ## 10 1901 Brisbane Apartment 11 free wifi,bbq,laundry,breakfast,pool,pa~ ## # ... with 4,168 more rows ``` ] ] --- ## Comparison with Base R import functions Base `R` already has functions for loading flat files, e.g. `read.csv()`, `read.delim()`. Advantages of the Tidyverse implementations include: - higher speed - ~~characters are not coerced to factors by default (see `stringsAsFactors` argument)~~ (This is not the case anymore since `R` version 4.0) - generates tibbles instead of data frames --- name: tibble class: center, middle, inverse <img src="figures//03-tibble.png" width="250px" /> --- ## [Tibbles](https://tibble.tidyverse.org/) .pull-left80[ A tibble (class `tbl_df`) is "a modern reimagining" of the data frame. Advantages over traditional data frames include: - improved print method that shows... - ...only the first 10 rows - ...all the columns that fit on screen + names of the remaining ones - ...column types - more consistent subsetting - less type coercion - prohibits partial matching ] .pull-right20[ <img src="figures//03-tibble.png" width="100%" /> ] .font80[ .content-box-green[ Use `as_tibble()` to convert a data frame to a tibble. ] ] ??? - iris[, c(1,2)] returns data frame, but iris[, 1] returns a vector --- ## Tibble vs. data frame .panelset[ .panel[.panel-name[Print `data.frame`] ```r class(iris) ``` ``` ## [1] "data.frame" ``` ```r iris ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 12 4.8 3.4 1.6 0.2 setosa ## 13 4.8 3.0 1.4 0.1 setosa ## 14 4.3 3.0 1.1 0.1 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa ## 18 5.1 3.5 1.4 0.3 setosa ## 19 5.7 3.8 1.7 0.3 setosa ## 20 5.1 3.8 1.5 0.3 setosa ## 21 5.4 3.4 1.7 0.2 setosa ## 22 5.1 3.7 1.5 0.4 setosa ## 23 4.6 3.6 1.0 0.2 setosa ## 24 5.1 3.3 1.7 0.5 setosa ## 25 4.8 3.4 1.9 0.2 setosa ## 26 5.0 3.0 1.6 0.2 setosa ## 27 5.0 3.4 1.6 0.4 setosa ## 28 5.2 3.5 1.5 0.2 setosa ## 29 5.2 3.4 1.4 0.2 setosa ## 30 4.7 3.2 1.6 0.2 setosa ## 31 4.8 3.1 1.6 0.2 setosa ## 32 5.4 3.4 1.5 0.4 setosa ## 33 5.2 4.1 1.5 0.1 setosa ## 34 5.5 4.2 1.4 0.2 setosa ## 35 4.9 3.1 1.5 0.2 setosa ## 36 5.0 3.2 1.2 0.2 setosa ## 37 5.5 3.5 1.3 0.2 setosa ## 38 4.9 3.6 1.4 0.1 setosa ## 39 4.4 3.0 1.3 0.2 setosa ## 40 5.1 3.4 1.5 0.2 setosa ## 41 5.0 3.5 1.3 0.3 setosa ## 42 4.5 2.3 1.3 0.3 setosa ## 43 4.4 3.2 1.3 0.2 setosa ## 44 5.0 3.5 1.6 0.6 setosa ## 45 5.1 3.8 1.9 0.4 setosa ## 46 4.8 3.0 1.4 0.3 setosa ## 47 5.1 3.8 1.6 0.2 setosa ## 48 4.6 3.2 1.4 0.2 setosa ## 49 5.3 3.7 1.5 0.2 setosa ## 50 5.0 3.3 1.4 0.2 setosa ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor ## 57 6.3 3.3 4.7 1.6 versicolor ## 58 4.9 2.4 3.3 1.0 versicolor ## 59 6.6 2.9 4.6 1.3 versicolor ## 60 5.2 2.7 3.9 1.4 versicolor ## 61 5.0 2.0 3.5 1.0 versicolor ## 62 5.9 3.0 4.2 1.5 versicolor ## 63 6.0 2.2 4.0 1.0 versicolor ## 64 6.1 2.9 4.7 1.4 versicolor ## 65 5.6 2.9 3.6 1.3 versicolor ## 66 6.7 3.1 4.4 1.4 versicolor ## 67 5.6 3.0 4.5 1.5 versicolor ## 68 5.8 2.7 4.1 1.0 versicolor ## 69 6.2 2.2 4.5 1.5 versicolor ## 70 5.6 2.5 3.9 1.1 versicolor ## 71 5.9 3.2 4.8 1.8 versicolor ## 72 6.1 2.8 4.0 1.3 versicolor ## 73 6.3 2.5 4.9 1.5 versicolor ## 74 6.1 2.8 4.7 1.2 versicolor ## 75 6.4 2.9 4.3 1.3 versicolor ## 76 6.6 3.0 4.4 1.4 versicolor ## 77 6.8 2.8 4.8 1.4 versicolor ## 78 6.7 3.0 5.0 1.7 versicolor ## 79 6.0 2.9 4.5 1.5 versicolor ## 80 5.7 2.6 3.5 1.0 versicolor ## 81 5.5 2.4 3.8 1.1 versicolor ## 82 5.5 2.4 3.7 1.0 versicolor ## 83 5.8 2.7 3.9 1.2 versicolor ## 84 6.0 2.7 5.1 1.6 versicolor ## 85 5.4 3.0 4.5 1.5 versicolor ## 86 6.0 3.4 4.5 1.6 versicolor ## 87 6.7 3.1 4.7 1.5 versicolor ## 88 6.3 2.3 4.4 1.3 versicolor ## 89 5.6 3.0 4.1 1.3 versicolor ## 90 5.5 2.5 4.0 1.3 versicolor ## 91 5.5 2.6 4.4 1.2 versicolor ## 92 6.1 3.0 4.6 1.4 versicolor ## 93 5.8 2.6 4.0 1.2 versicolor ## 94 5.0 2.3 3.3 1.0 versicolor ## 95 5.6 2.7 4.2 1.3 versicolor ## 96 5.7 3.0 4.2 1.2 versicolor ## 97 5.7 2.9 4.2 1.3 versicolor ## 98 6.2 2.9 4.3 1.3 versicolor ## 99 5.1 2.5 3.0 1.1 versicolor ## 100 5.7 2.8 4.1 1.3 versicolor ## 101 6.3 3.3 6.0 2.5 virginica ## 102 5.8 2.7 5.1 1.9 virginica ## 103 7.1 3.0 5.9 2.1 virginica ## 104 6.3 2.9 5.6 1.8 virginica ## 105 6.5 3.0 5.8 2.2 virginica ## 106 7.6 3.0 6.6 2.1 virginica ## 107 4.9 2.5 4.5 1.7 virginica ## 108 7.3 2.9 6.3 1.8 virginica ## 109 6.7 2.5 5.8 1.8 virginica ## 110 7.2 3.6 6.1 2.5 virginica ## 111 6.5 3.2 5.1 2.0 virginica ## 112 6.4 2.7 5.3 1.9 virginica ## 113 6.8 3.0 5.5 2.1 virginica ## 114 5.7 2.5 5.0 2.0 virginica ## 115 5.8 2.8 5.1 2.4 virginica ## 116 6.4 3.2 5.3 2.3 virginica ## 117 6.5 3.0 5.5 1.8 virginica ## 118 7.7 3.8 6.7 2.2 virginica ## 119 7.7 2.6 6.9 2.3 virginica ## 120 6.0 2.2 5.0 1.5 virginica ## 121 6.9 3.2 5.7 2.3 virginica ## 122 5.6 2.8 4.9 2.0 virginica ## 123 7.7 2.8 6.7 2.0 virginica ## 124 6.3 2.7 4.9 1.8 virginica ## 125 6.7 3.3 5.7 2.1 virginica ## 126 7.2 3.2 6.0 1.8 virginica ## 127 6.2 2.8 4.8 1.8 virginica ## 128 6.1 3.0 4.9 1.8 virginica ## 129 6.4 2.8 5.6 2.1 virginica ## 130 7.2 3.0 5.8 1.6 virginica ## 131 7.4 2.8 6.1 1.9 virginica ## 132 7.9 3.8 6.4 2.0 virginica ## 133 6.4 2.8 5.6 2.2 virginica ## 134 6.3 2.8 5.1 1.5 virginica ## 135 6.1 2.6 5.6 1.4 virginica ## 136 7.7 3.0 6.1 2.3 virginica ## 137 6.3 3.4 5.6 2.4 virginica ## 138 6.4 3.1 5.5 1.8 virginica ## 139 6.0 3.0 4.8 1.8 virginica ## 140 6.9 3.1 5.4 2.1 virginica ## 141 6.7 3.1 5.6 2.4 virginica ## 142 6.9 3.1 5.1 2.3 virginica ## 143 5.8 2.7 5.1 1.9 virginica ## 144 6.8 3.2 5.9 2.3 virginica ## 145 6.7 3.3 5.7 2.5 virginica ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica ``` ] .panel[.panel-name[Print `tbl_df`] ```r as_tibble(iris) ``` ``` ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with 140 more rows ``` ] .panel[.panel-name[Line width] Tibbles adjust to the available line width! ```r as_tibble(iris) ``` ``` ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length ## <dbl> <dbl> <dbl> ## 1 5.1 3.5 1.4 ## 2 4.9 3 1.4 ## 3 4.7 3.2 1.3 ## 4 4.6 3.1 1.5 ## 5 5 3.6 1.4 ## 6 5.4 3.9 1.7 ## 7 4.6 3.4 1.4 ## 8 5 3.4 1.5 ## 9 4.4 2.9 1.4 ## 10 4.9 3.1 1.5 ## # ... with 140 more rows, and 2 more variables: ## # Petal.Width <dbl>, Species <fct> ``` ] .panel[.panel-name[Build function] ```r df <- data.frame( `bad name` = 1:3, # syntactically invalid name because of SPACE x = rep(letters[1:2], length.out = 3) ) str(df) ``` ``` ## 'data.frame': 3 obs. of 2 variables: ## $ bad.name: int 1 2 3 ## $ x : chr "a" "b" "a" ``` Tibbles don't coerce character vectors to factors. ```r ti <- tibble( `bad name` = 1:3, # no "auto repair" of invalid column name x = rep(letters[1:2], length.out = 3) ) str(ti) ``` ``` ## tibble [3 x 2] (S3: tbl_df/tbl/data.frame) ## $ bad name: int [1:3] 1 2 3 ## $ x : chr [1:3] "a" "b" "a" ``` ] .panel[.panel-name[Rowwise construction: `tribble()`] Tibbles can also be created in a spreadsheet-like rowwise fashion: ```r tribble( ~x, ~y, 1, "b", 2, "a" ) ``` ``` ## # A tibble: 2 x 2 ## x y ## <dbl> <chr> ## 1 1 b ## 2 2 a ``` .content-box-blue[Use the tilde operator (`~`) to signify column names.] ] ] --- exclude: true name: purrr ## The `purrr` package 😻 The `purrr` package provides a family of functions for applying some function to each element of a vector and saving the result. .pull-left70[ `map*()` takes as arguments 1) an atomic vector or a list and 2) a function. It return a new vector of the **same length** as the input. The **type** of the vector is specified by the **suffix** of the `map*()` function. - `map()` returns a list - `map_lgl()` returns a logical vector - `map_int()` returns an integer vector - `map_dbl()` returns a double vector - `map_chr()` returns a character vector Function reference: RStudio's [`purrr` cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf) ] .pull-right30[ <img src="figures//03-purrr.png" width="100%" /> ] .footnote[ The `*apply()` functions in base `R` (e.g. `apply()`, `lapply()`, `sapply()`, ...) have similar purposes, however they are inconsistent in their output and hence not recommended. ] ??? - problem with loops: you iterate over something, but it doesnt clearly convey a high level goal --- exclude: true class: middle, exercise-blue ## Your turn 8
03
:
00
Actually, we don't need to store the property facilities as tibble, since the column `facility_nr` is redundant. We change the `facilities` column to a list column, such that each element is a character vector of property facilities. ```r properties_l <- properties %>% mutate(facilities = strsplit(facilities, ",")) head(properties_l, 3) ``` ``` ## # A tibble: 3 x 5 ## property_id destination property_type nr_rooms facilities ## <dbl> <chr> <chr> <dbl> <list> ## 1 2668 Brisbane Hotel 32 <chr [6]> ## 2 4656 Brisbane Hotel 39 <chr [7]> ## 3 4563 Brisbane Apartment 9 <chr [1]> ``` Example: ```r properties_l$facilities[[1]] ``` ``` ## [1] "airport shuttle" "free wifi" "garden" "breakfast" ## [5] "pool" "on-site restaurant" ``` Add a column `n_features` to `properties` that contains the number of facilities. Hint: Use the `map_*()` function that returns an integer vector. --- exclude: true ## Iteration over list columns Perform a statistical test for each day of the week: _is there a difference in ratings between business travelers vs. non-business travelers?_ ```r bookings %>% group_by(checkin_day) %>% summarise(t_test = t.test(review_score ~ for_business)) ``` ``` ## Error: Problem with `summarise()` input `t_test`. ## x Input `t_test` must be a vector, not a `htest` object. ## i Input `t_test` is `t.test(review_score ~ for_business)`. ## i The error occurred in group 1: checkin_day = "fri". ``` 🤔 _"Why does this code throw an error?"_ The return value of `t.test()` is of type `list`. When using `dplyr` operations like `mutate()`, we must explicitly declare that we want a list column. ```r bookings %>% group_by(checkin_day) %>% * summarize(t_test = list(t.test(review_score ~ for_business))) ``` ``` ## # A tibble: 7 x 2 ## checkin_day t_test ## <chr> <list> ## 1 fri <htest> ## 2 mon <htest> ## 3 sat <htest> ## 4 sun <htest> ## 5 thu <htest> ## 6 tue <htest> ## 7 wed <htest> ``` Extract the `\(p\)`-value from the `t_test` list column. Since we want a numeric column, we use `map_dbl()` to extract a double from a list. ```r bookings %>% group_by(checkin_day) %>% summarize(t_test = list(t.test(review_score ~ for_business))) %>% mutate(p_value = map_dbl(t_test, "p.value")) ``` ``` ## # A tibble: 7 x 3 ## checkin_day t_test p_value ## <chr> <list> <dbl> ## 1 fri <htest> 1.27e-10 ## 2 mon <htest> 3.93e- 9 ## 3 sat <htest> 5.01e- 4 ## 4 sun <htest> 3.25e-16 ## 5 thu <htest> 1.42e-18 ## 6 tue <htest> 6.44e-14 ## 7 wed <htest> 1.65e-15 ``` --- exclude: true class: middle, exercise-blue ## Your turn 9
10
:
00
### _Which factors contribute to the average property review score?_ For **each city**, fit a **linear regression** model<sup>1</sup> to predict a property's **average review score** based on average price per night, number of bookings (stayed or canceled), and property type. Compare the quality of the three models using `\(R^2\)`. Strategy: 1. Compute the summary statistics `avg_review` (average review score), `n_bookings` (number of bookings) and `avg_price` (average price per night) for each property and city. 2. For each city, fit a linear regression model (?`lm`) using the formula `avg_review ~ property_type + n_bookings + avg_price`. Save these models as new column `fit`. 3. Extract the `\(R^2\)` value (`r.squared`) from `fit` using the appropriate `map_*` function. .footnote[ [1] Remove properties with less than 2 non-missing review scores. ] --- exclude: true ## Further materials - Hadley Wickham and Garrett Grolemund. ["R for Data Science"](http://r4ds.had.co.nz/). O'Reilly, 2017. Chapter: - [Data transformation](http://r4ds.had.co.nz/transform.html) - [Tibbles](http://r4ds.had.co.nz/tibbles.html) - [Data import](http://r4ds.had.co.nz/data-import.html) - [Tidy data](http://r4ds.had.co.nz/tidy-data.html) - [Relational data](http://r4ds.had.co.nz/relational-data.html) - [Iteration](http://r4ds.had.co.nz/iteration.html) - RStudio's cheat sheets (see next slides): - [Data Import (`readr + tidyr`)](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf) - [`dplyr`](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) - [`purrr`](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf) --- class: middle ## Very useful tidyverse packages we did not cover <img src="figures//03-stringr_forcats_lubridate.png" width="80%" /> For an introduction, see the following chapters from Hadley Wickham and Garrett Grolemund. ["R for Data Science"](http://r4ds.had.co.nz/). O'Reilly, 2017. - [Strings with stringr](https://r4ds.had.co.nz/strings.html) - [Factors with forcats](https://r4ds.had.co.nz/factors.html) - [Dates and Times with lubridate](https://r4ds.had.co.nz/dates-and-times.html) --- ## Session info ``` ## setting value ## version R version 4.0.4 (2021-02-15) ## os Windows 10 x64 ## system x86_64, mingw32 ## ui RTerm ## language EN ## collate English_United States.1252 ## ctype English_United States.1252 ## tz Europe/Berlin ## date 2021-04-13 ``` <div style="font-size:80%;"> .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> package </th> <th style="text-align:left;"> version </th> <th style="text-align:left;"> date </th> <th style="text-align:left;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> dplyr </td> <td style="text-align:left;"> 1.0.5 </td> <td style="text-align:left;"> 2021-03-05 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> forcats </td> <td style="text-align:left;"> 0.5.1 </td> <td style="text-align:left;"> 2021-01-27 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> ggplot2 </td> <td style="text-align:left;"> 3.3.3 </td> <td style="text-align:left;"> 2020-12-30 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> purrr </td> <td style="text-align:left;"> 0.3.4 </td> <td style="text-align:left;"> 2020-04-17 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> readr </td> <td style="text-align:left;"> 1.4.0 </td> <td style="text-align:left;"> 2020-10-05 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> stringr </td> <td style="text-align:left;"> 1.4.0 </td> <td style="text-align:left;"> 2019-02-10 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> tibble </td> <td style="text-align:left;"> 3.1.0 </td> <td style="text-align:left;"> 2021-02-25 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> tidyr </td> <td style="text-align:left;"> 1.1.3 </td> <td style="text-align:left;"> 2021-03-03 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> tidyverse </td> <td style="text-align:left;"> 1.3.0 </td> <td style="text-align:left;"> 2019-11-21 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> </tbody> </table> ] .pull-right[ ] </div> --- class: last-slide, center, bottom # Thank you! Questions? .courtesy[📷 Photo courtesy of Stefan Berger]