class: title-slide, center, bottom # 02 - Tips for effective visualizations ## Data Science with R · Summer 2021 ### Uli Niemann · Knowledge Management & Discovery Lab #### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/) .courtesy[📷 Photo courtesy of Ulrich Arendt] --- ## Keep it simple .pull-left[ <img src="figures//02-pie.png" width="100%" /> ] .pull-right[ <img src="figures/_gen/02/bar-alternative-1.png" width="425.196850393701" /> ] .footnote[.font80[Slides adapted from: [Introduction to Data Science Course 2020 @ Univ. Edinburgh](https://ids-s1-20.github.io/slides/week-05/w5-d02-effective-dataviz/w5-d02-effective-dataviz.html#1)]] ??? left: created with MS Excel --- ## Use color to draw attention .pull-left[ <img src="figures/_gen/02/bar-color-1-1.png" width="425.196850393701" /> ] .pull-right[ <img src="figures/_gen/02/bar-color-2-1.png" width="425.196850393701" /> ] --- ## Tell a story .panelset[ .panel[.panel-name[Does the year matter?] <img src="figures/_gen/02/bundesliga-ftg-color-1.png" width="708.661417322835" /> ] .panel[.panel-name[Plot annotation] <img src="figures/_gen/02/bundesliga-ftg-label-1.png" width="708.661417322835" /> ] ] --- ## Principles for effective visualizations .font130[ 🔢 Order matters 🔄 Put long categories on the y-axis 📐 Keep scales consistent 🎨 Select meaningful colors 🏷️ Use meaningful and nonredundant labels ] --- ## Alphabetical order is rarely ideal .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/poll-bar-2-1.png" width="576" /> ] .panel[.panel-name[Code] ```r library(gapminder) l97 <- filter(gapminder, year == 2007, lifeExp > 70) ggplot(l97, aes(x = continent)) + geom_bar() ``` ] ] --- ## Order by frequency .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/poll-bar-3-1.png" width="576" /> ] .panel[.panel-name[Code] `fct_infreq()`: Reorder factor levels by frequency. ```r ggplot(l97, aes(x = fct_infreq(continent))) + geom_bar() ``` ] ] --- ## Alphabetical order is rarely ideal .panelset[ .panel[.panel-name[5 Plot] <img src="figures/_gen/02/poll-bar-1-1.png" width="576" /> <!-- subtitle = '"If there were a federal election next Sunday, which party would you vote for?"', --> .content-box-yellow[ Since we're using `geom_col()` we can't use `fct_infreq()` because every category (i.e. party) appears exactly in one and only one observation. ] ] .panel[.panel-name[3 Code to prep data] .pull-left[ ```r umfrage <- read_rds( here::here("data", "umfrage.rds") ) umfrage ``` ``` ## # A tibble: 56 x 3 ## party pollster popularity ## <chr> <chr> <dbl> ## 1 CDU/CSU Allensbach 28.5 ## 2 CDU/CSU Kantar(Emnid) 25 ## 3 CDU/CSU Forsa 26 ## 4 CDU/CSU Forsch’gr.Wahlen 28 ## 5 CDU/CSU GMS 37 ## 6 CDU/CSU Infratestdimap 29 ## 7 CDU/CSU INSA 28 ## 8 CDU/CSU Yougov 33 ## 9 SPD Allensbach 18 ## 10 SPD Kantar(Emnid) 17 ## # ... with 46 more rows ``` ```r (date_range <- attr(umfrage, "date_range")) ``` ``` ## [1] "2021-02-15" "2021-03-27" ``` ] .pull-right[ ```r (date_range_chr <- paste0( date_range, collapse = " - ")) ``` ``` ## [1] "2021-02-15 - 2021-03-27" ``` ```r umfrage_avg <- umfrage %>% group_by(party) %>% summarize(popularity = mean(popularity)) %>% ungroup() umfrage_avg ``` ``` ## # A tibble: 7 x 2 ## party popularity ## <chr> <dbl> ## 1 AfD 10.5 ## 2 CDU/CSU 29.3 ## 3 DIE LINKE 7.69 ## 4 FDP 9 ## 5 GRÜNE 20.7 ## 6 Other 6.06 ## 7 SPD 16.8 ``` ] ] .panel[.panel-name[4 Code for plot] ```r ggplot(umfrage_avg, aes(x = party, y = popularity)) + geom_col() + labs( x = NULL, y = "Popularity (%)", title = "German parliament election poll", subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"), caption = "Data source: https://www.wahlrecht.de/umfragen/" ) + theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) ``` ] .panel[.panel-name[1 Poll aggregator website] <https://www.wahlrecht.de/umfragen/> <img src="figures//02-wahlrecht-umfrage.png" width="80%" /> ] .panel[.panel-name[2 Code to scrape data] ```r library(rvest) umfrage <- read_html("https://www.wahlrecht.de/umfragen/") %>% html_node(".wilko") %>% html_table() umfrage[names(umfrage) == ""] <- NULL umfrage[length(umfrage)] <- NULL # "Letzte Bundestagswahl" date_range <- as.character(range(lubridate::dmy(as.character(umfrage[1, ][-1])))) umfrage <- umfrage %>% filter(Institut %in% c("CDU/CSU", "SPD", "GRÜNE", "FDP", "DIE LINKE", "AfD", "Sonstige")) %>% rename(party = Institut) %>% pivot_longer(cols = -party, names_to = "pollster", values_to = "popularity") %>% mutate(popularity = str_replace(popularity, ",", "\\.")) %>% mutate(popularity = str_remove(popularity, " %")) %>% mutate(popularity = as.double(popularity)) %>% mutate(party = ifelse(party == "Sonstige", "Other", party)) attr(umfrage, "date_range") <- date_range write_rds(umfrage, here::here("data", "umfrage.rds")) ``` ] ] --- ## Order by a second variable .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/poll-bar-4-1.png" width="576" /> ] .panel[.panel-name[Code] `fct_reorder()`: Reorder factor levels by another numeric variable. Use `-` to sort in descending order. ```r ggplot( umfrage_avg, aes( * x = fct_reorder(party, -popularity), y = popularity ) ) + geom_col() + labs( x = NULL, y = "Popularity (%)", title = "German parliament election poll", subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"), caption = "Data source: https://www.wahlrecht.de/umfragen/" ) + theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) ``` ] ] --- ## Custom order .content-box-gray[ Sometimes you see in election polls that the parties are shown in the order of their vote shares in the previous election. For example, in the 2017 elections the SPD received the second most votes, whereas GRÜNE were only sixth. ] .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/poll-bar-5-1.png" width="576" /> ] .panel[.panel-name[Code] `fct_relevel()`: Manually reorder factor levels. ```r umfrage_avg <- umfrage_avg %>% mutate( * party = fct_relevel(party, * "CDU/CSU", "SPD", "AfD", "FDP", "DIE LINKE", "GRÜNE", "Other" * ) ) ggplot(umfrage_avg, aes(x = party, y = popularity)) + geom_col() + labs( x = NULL, y = "Popularity (%)", title = "German parliament election poll", subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"), caption = "Data source: https://www.wahlrecht.de/umfragen/" ) + theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) ``` ] ] --- ## Factor levels often need to be cleaned up .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/poll-bar-6-1.png" width="360" /> ] .panel[.panel-name[Code] ```r ggplot(attrition, aes(x = BusinessTravel)) + geom_bar() ``` ] ] ??? - remove "Travel" from factor labels --- ## Clean up labels .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/poll-bar-7-1.png" width="360" /> ] .panel[.panel-name[Code] `fct_recode()`: Manually relabel factor levels. ```r attrition <- attrition %>% mutate( BusinessTravel = fct_recode( BusinessTravel, "Frequently" = "Travel_Frequently", "Rarely" = "Travel_Rarely", "Non" = "Non-Travel" ) ) ggplot(attrition, aes(x = BusinessTravel)) + geom_bar() ``` ] ] --- ## Put long and overlapping categories on the y-axis .pull-left[ Categories on x-axis: ```r ggplot( umfrage_avg, aes(x = party, y = popularity) ) + geom_col() ``` <img src="figures/_gen/02/long-cats-y-1-1.png" width="360" /> ] .pull-right[ Categories on y-axis: ```r ggplot( umfrage_avg, aes(x = popularity, y = party) ) + geom_col() ``` <img src="figures/_gen/02/long-cats-y-2-1.png" width="360" /> ] --- ## Reverse the order of levels .panelset[ .panel[.panel-name[`fct_rev()`] `fct_rev()`: Reverse the order of factor levels ```r ggplot(umfrage_avg, aes(x = popularity, * y = fct_rev(party))) + geom_col() ``` <img src="figures/_gen/02/long-cats-y-3-1.png" width="360" /> ] .panel[.panel-name[Via scale setting] `rev()`: Reverse the order of values (any vector type) ```r ggplot(umfrage_avg, aes(x = popularity, y = party)) + geom_col() + * scale_y_discrete(limits = rev) ``` <img src="figures/_gen/02/long-cats-y-4-1.png" width="360" /> ] ] --- ## Before plotting, think about the purpose .content-box-blue[ **Example:** What is the number and share of women for each education field in the attrition data? ] .panelset[ .panel[.panel-name[Stacked bars] ```r ggplot(attrition, aes(y = EducationField, fill = Gender)) + geom_bar() + theme(legend.position = "bottom") ``` <img src="figures/_gen/02/purpose-1-1.png" width="566.929133858268" /> ] .panel[.panel-name[Filled bars] ```r ggplot(attrition, aes(y = EducationField, fill = Gender)) + geom_bar(position = "fill") + theme(legend.position = "bottom") ``` <img src="figures/_gen/02/purpose-2-1.png" width="566.929133858268" /> ] .panel[.panel-name[Dodged bars] ```r ggplot(attrition, aes(y = EducationField, fill = Gender)) + geom_bar(position = "dodge") + theme(legend.position = "bottom") ``` <img src="figures/_gen/02/purpose-3-1.png" width="566.929133858268" /> ] .panel[.panel-name[Facetted bars] ```r ggplot(attrition, aes(y = Gender, fill = Gender)) + geom_bar() + facet_wrap(~ EducationField) + theme(legend.position = "bottom") ``` <img src="figures/_gen/02/purpose-4-1.png" width="566.929133858268" /> ] ] ??? - because the totals differs between the fields, it is -e.g.- hard to see the percentage of women WITHIN the human resources category. - default: position_stack - try: position_fill --- ## Avoid redundancy .panelset[ .panel[.panel-name[High redundancy] .font130[🚫 DON'T] ```r ggplot(attrition, aes(y = Gender, fill = Gender, color = Gender, linetype = Gender, alpha = Gender)) + geom_bar(size = 2) + facet_wrap(~ EducationField) + scale_color_brewer(palette = "Set1") ``` ``` ## Warning: Using alpha for a discrete variable is not advised. ``` <img src="figures/_gen/02/purpose-5-1.png" width="680.314960629921" /> ] .panel[.panel-name[Low redundancy] <!-- ✅ DO --> ```r ggplot(attrition, aes(y = Gender)) + geom_bar() + facet_wrap(~ EducationField) ``` <img src="figures/_gen/02/purpose-6-1.png" width="566.929133858268" /> ] ] --- ## Keep scales consistent .font130[🚫 DON'T] ```r ggplot(attrition, aes(y = Gender)) + geom_bar() + * facet_wrap(~ EducationField, scales = "free_x") ``` <img src="figures/_gen/02/consistent-scales-1-1.png" width="566.929133858268" /> --- ## Use meaningful and nonredundant labels .panelset[ .panel[.panel-name[Without context] ```r ggplot(umfrage_avg, aes(x = party, y = popularity)) + geom_col() ``` <img src="figures/_gen/02/meaningful-labels-1-1.png" width="708.661417322835" /> ] .panel[.panel-name[With context] ```r ggplot(umfrage_avg, aes(x = party, y = popularity)) + geom_col()+ labs(x = NULL, y = "Popularity (%)", title = "German parliament election poll", subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"), caption = "Data source: https://www.wahlrecht.de/umfragen/") ``` <img src="figures/_gen/02/meaningful-labels-2-1.png" width="708.661417322835" /> ] ] --- ## Select meaningful colors .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/redundancy-1-1.png" width="708.661417322835" /> ] .panel[.panel-name[Code] ```r umfrage_avg <- umfrage_avg %>% mutate(party = fct_reorder(party, -popularity)) ggplot(umfrage_avg, aes(x = party, y = popularity, fill = party)) + geom_col() + labs(x = NULL, y = "Popularity (%)", title = "German parliament election poll", subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"), caption = "Data source: https://www.wahlrecht.de/umfragen/") + theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) + * scale_fill_manual(values = c("CDU/CSU" = "#000000", "GRÜNE" = "#1FAF12", "SPD" = "#E30013", *"AfD" = "#009DE0", "DIE LINKE" = "#DF007D", "FDP" = "#FFED00", "Other" = "gray80")) ``` ] ] --- ## Be selective with redundancy .panelset[ .panel[.panel-name[Plot] <img src="figures/_gen/02/redundancy-2-1.png" width="595.275590551181" /> ] .panel[.panel-name[Code] ```r umfrage_avg <- umfrage_avg %>% mutate(party = fct_reorder(party, -popularity)) ggplot(umfrage_avg, aes(x = party, y = popularity, fill = party)) + geom_col() + labs(x = NULL, y = "Popularity (%)", title = "German parliament election poll", subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"), caption = "Data source: https://www.wahlrecht.de/umfragen/") + theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) + scale_fill_manual(values = c("CDU/CSU" = "#000000", "GRÜNE" = "#1FAF12", "SPD" = "#E30013", "AfD" = "#009DE0", "DIE LINKE" = "#DF007D", "FDP" = "#FFED00", "Other" = "gray80")) + guides(fill = FALSE) ``` ] ] --- ## Select meaningful colors .panelset[ .panel[.panel-name[No color] <img src="figures/_gen/02/unnamed-chunk-13-1.png" width="566.929133858268" /> ] .panel[.panel-name[ColorBrewer website] <iframe src="https://colorbrewer2.org" width="100%" height="500px"></iframe> ] .panel[.panel-name[Manual colors] ```r ggplot(attrition, aes(y = fct_rev(Gender), fill = Gender)) + geom_bar() + facet_wrap(~ EducationField) + * scale_fill_manual(values = c("Female" = "#7fc97f", "Male" = "#fdc086")) + guides(fill = FALSE) ``` <img src="figures/_gen/02/purpose-7-1.png" width="566.929133858268" /> ] .panel[.panel-name[RColorBrewer package] ```r RColorBrewer::display.brewer.all() ``` <img src="figures//02-colorbrewer.png" width="45%" /> ] .panel[.panel-name[Palette] ```r ggplot(attrition, aes(y = fct_rev(Gender), fill = Gender)) + geom_bar() + facet_wrap(~ EducationField) + * scale_fill_brewer(palette = "Pastel2") + guides(fill = FALSE) ``` <img src="figures/_gen/02/purpose-8-1.png" width="566.929133858268" /> ] ] --- ## Session info ``` ## setting value ## version R version 4.0.4 (2021-02-15) ## os Windows 10 x64 ## system x86_64, mingw32 ## ui RTerm ## language (EN) ## collate English_United States.1252 ## ctype English_United States.1252 ## tz Europe/Berlin ## date 2021-03-29 ``` <div style="font-size:80%;"> .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> package </th> <th style="text-align:left;"> version </th> <th style="text-align:left;"> date </th> <th style="text-align:left;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> dplyr </td> <td style="text-align:left;"> 1.0.5 </td> <td style="text-align:left;"> 2021-03-05 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> forcats </td> <td style="text-align:left;"> 0.5.1 </td> <td style="text-align:left;"> 2021-01-27 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> gapminder </td> <td style="text-align:left;"> 0.3.0 </td> <td style="text-align:left;"> 2017-10-31 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> ggplot2 </td> <td style="text-align:left;"> 3.3.3 </td> <td style="text-align:left;"> 2020-12-30 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> purrr </td> <td style="text-align:left;"> 0.3.4 </td> <td style="text-align:left;"> 2020-04-17 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> </tbody> </table> ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> package </th> <th style="text-align:left;"> version </th> <th style="text-align:left;"> date </th> <th style="text-align:left;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> readr </td> <td style="text-align:left;"> 1.4.0 </td> <td style="text-align:left;"> 2020-10-05 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> stringr </td> <td style="text-align:left;"> 1.4.0 </td> <td style="text-align:left;"> 2019-02-10 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> tibble </td> <td style="text-align:left;"> 3.1.0 </td> <td style="text-align:left;"> 2021-02-25 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> tidyr </td> <td style="text-align:left;"> 1.1.3 </td> <td style="text-align:left;"> 2021-03-03 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> tidyverse </td> <td style="text-align:left;"> 1.3.0 </td> <td style="text-align:left;"> 2019-11-21 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> </tbody> </table> ] </div> --- class: last-slide, center, bottom # Thank you! Questions? .courtesy[📷 Photo courtesy of Stefan Berger]