Tips for effective visualizations

class: title-slide, center, bottom

# 02 - Tips for effective visualizations

## Data Science with R &#183; Summer 2021

### Uli Niemann &#183; Knowledge Management & Discovery Lab

#### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/)

.courtesy[&#x1F4F7; Photo courtesy of Ulrich Arendt]

---

## Keep it simple

.pull-left[

]

.pull-right[

]

.footnote[.font80[Slides adapted from: [Introduction to Data Science Course 2020 @ Univ. Edinburgh](https://ids-s1-20.github.io/slides/week-05/w5-d02-effective-dataviz/w5-d02-effective-dataviz.html#1)]]

???

left: created with MS Excel

---

## Use color to draw attention

.pull-left[

]

.pull-right[

]

---

## Tell a story

.panelset[

.panel[.panel-name[Does the year matter?]

]
.panel[.panel-name[Plot annotation]

]

---

## Principles for effective visualizations

.font130[

&#x1F522; Order matters

&#x1F504; Put long categories on the y-axis

&#x1F4D0; Keep scales consistent

&#x1F3A8; Select meaningful colors

&#x1F3F7;&#xFE0F; Use meaningful and nonredundant labels

]

---

## Alphabetical order is rarely ideal

.panelset[

.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

```r
library(gapminder)
l97 <- filter(gapminder, year == 2007, lifeExp > 70)

ggplot(l97, aes(x = continent)) +
  geom_bar() 
```

]

---

## Order by frequency

.panelset[

.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

`fct_infreq()`: Reorder factor levels by frequency.

```r
ggplot(l97, aes(x = fct_infreq(continent))) +
  geom_bar() 
```

]

---

## Alphabetical order is rarely ideal

.panelset[
.panel[.panel-name[5 Plot]

.content-box-yellow[

Since we're using `geom_col()` we can't use `fct_infreq()` because every category (i.e. party) appears exactly in one and only one observation.

]

.panel[.panel-name[3 Code to prep data]

.pull-left[

```r
umfrage <- read_rds(
  here::here("data", "umfrage.rds")
)
umfrage
```

```
## # A tibble: 56 x 3
##    party   pollster         popularity
##    <chr>   <chr>                 <dbl>
##  1 CDU/CSU Allensbach             28.5
##  2 CDU/CSU Kantar(Emnid)          25  
##  3 CDU/CSU Forsa                  26  
##  4 CDU/CSU Forsch’gr.Wahlen       28  
##  5 CDU/CSU GMS                    37  
##  6 CDU/CSU Infratestdimap         29  
##  7 CDU/CSU INSA                   28  
##  8 CDU/CSU Yougov                 33  
##  9 SPD     Allensbach             18  
## 10 SPD     Kantar(Emnid)          17  
## # ... with 46 more rows
```

```r
(date_range <- attr(umfrage, "date_range"))
```

```
## [1] "2021-02-15" "2021-03-27"
```

]

.pull-right[

```r
(date_range_chr <- paste0(
  date_range, collapse = " - "))
```

```
## [1] "2021-02-15 - 2021-03-27"
```

```r
umfrage_avg <- umfrage %>%
  group_by(party) %>%
  summarize(popularity = mean(popularity)) %>%
  ungroup()
umfrage_avg
```

```
## # A tibble: 7 x 2
##   party     popularity
##   <chr>          <dbl>
## 1 AfD            10.5 
## 2 CDU/CSU        29.3 
## 3 DIE LINKE       7.69
## 4 FDP             9   
## 5 GRÜNE          20.7 
## 6 Other           6.06
## 7 SPD            16.8
```

]

.panel[.panel-name[4 Code for plot]

```r
ggplot(umfrage_avg, aes(x = party, y = popularity)) +
  geom_col() +
  labs(
    x = NULL,
    y = "Popularity (%)",
    title = "German parliament election poll",
    subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"),
    caption = "Data source: https://www.wahlrecht.de/umfragen/"
  ) +
  theme(plot.subtitle = element_text(size = rel(0.8), face = "italic"))
```

]

.panel[.panel-name[1 Poll aggregator website]

<https://www.wahlrecht.de/umfragen/>

]
.panel[.panel-name[2 Code to scrape data]

```r
library(rvest)
umfrage <- read_html("https://www.wahlrecht.de/umfragen/") %>%
  html_node(".wilko") %>%
  html_table()
umfrage[names(umfrage) == ""] <- NULL
umfrage[length(umfrage)] <- NULL # "Letzte Bundestagswahl"

date_range <- as.character(range(lubridate::dmy(as.character(umfrage[1, ][-1]))))

umfrage <- umfrage %>%
  filter(Institut %in% c("CDU/CSU", "SPD", "GRÜNE", "FDP", "DIE LINKE", "AfD", "Sonstige")) %>%
  rename(party = Institut) %>%
  pivot_longer(cols = -party, names_to = "pollster", values_to = "popularity") %>%
  mutate(popularity = str_replace(popularity, ",", "\\.")) %>%
  mutate(popularity = str_remove(popularity, " %")) %>%
  mutate(popularity = as.double(popularity)) %>%
  mutate(party = ifelse(party == "Sonstige", "Other", party))

attr(umfrage, "date_range") <- date_range

write_rds(umfrage, here::here("data", "umfrage.rds"))
```

]

---

## Order by a second variable

.panelset[

.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

`fct_reorder()`: Reorder factor levels by another numeric variable. Use `-` to sort in descending order.

```r
ggplot(
  umfrage_avg, 
  aes(
*   x = fct_reorder(party, -popularity),
    y = popularity
  )
) +
  geom_col() +
  labs(
    x = NULL,
    y = "Popularity (%)",
    title = "German parliament election poll",
    subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"),
    caption = "Data source: https://www.wahlrecht.de/umfragen/"
  ) +
  theme(plot.subtitle = element_text(size = rel(0.8), face = "italic"))
```

]

---

## Custom order

.content-box-gray[

Sometimes you see in election polls that the parties are shown in the order of their vote shares in the previous election. For example, in the 2017 elections the SPD received the second most votes, whereas GRÜNE were only sixth.

]

.panelset[

.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

`fct_relevel()`: Manually reorder factor levels.

```r
umfrage_avg <- umfrage_avg %>%
  mutate(
*   party = fct_relevel(party,
*     "CDU/CSU", "SPD", "AfD", "FDP", "DIE LINKE", "GRÜNE", "Other"
*   )
  )
ggplot(umfrage_avg, aes(x = party, y = popularity)) +
  geom_col() +
  labs(
    x = NULL,
    y = "Popularity (%)",
    title = "German parliament election poll",
    subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"),
    caption = "Data source: https://www.wahlrecht.de/umfragen/"
  ) +
  theme(plot.subtitle = element_text(size = rel(0.8), face = "italic"))
```

]

---

## Factor levels often need to be cleaned up

.panelset[

.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

```r
ggplot(attrition, aes(x = BusinessTravel)) +
  geom_bar()
```

]

???

- remove "Travel" from factor labels

---

## Clean up labels

.panelset[

.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

`fct_recode()`: Manually relabel factor levels.

```r
attrition <- attrition %>%
  mutate(
    BusinessTravel = fct_recode(
      BusinessTravel,
      "Frequently" = "Travel_Frequently",
      "Rarely" = "Travel_Rarely",
      "Non" = "Non-Travel"
    )
  )

ggplot(attrition, aes(x = BusinessTravel)) +
  geom_bar()
```

]

---

## Put long and overlapping categories on the y-axis

.pull-left[

Categories on x-axis:

```r
ggplot(
  umfrage_avg, 
  aes(x = party, y = popularity)
) +
  geom_col()
```

]

.pull-right[

Categories on y-axis:

```r
ggplot(
  umfrage_avg, 
  aes(x = popularity, y = party)
) +
  geom_col()
```

]

---

## Reverse the order of levels

.panelset[
.panel[.panel-name[`fct_rev()`]

`fct_rev()`: Reverse the order of factor levels

```r
ggplot(umfrage_avg, aes(x = popularity, 
*                       y = fct_rev(party))) +
  geom_col()
```

]
.panel[.panel-name[Via scale setting]

`rev()`: Reverse the order of values (any vector type)

```r
ggplot(umfrage_avg, aes(x = popularity, y = party)) + 
  geom_col() +
* scale_y_discrete(limits = rev)
```

]

---

## Before plotting, think about the purpose

.content-box-blue[

**Example:** What is the number and share of women for each education field in the attrition data?

]

.panelset[
.panel[.panel-name[Stacked bars]

```r
ggplot(attrition, aes(y = EducationField, fill = Gender)) +
  geom_bar() +
  theme(legend.position = "bottom")
```

]

.panel[.panel-name[Filled bars]

```r
ggplot(attrition, aes(y = EducationField, fill = Gender)) +
  geom_bar(position = "fill") +
  theme(legend.position = "bottom")
```

]

.panel[.panel-name[Dodged bars]

```r
ggplot(attrition, aes(y = EducationField, fill = Gender)) +
  geom_bar(position = "dodge") +
  theme(legend.position = "bottom")
```

]

.panel[.panel-name[Facetted bars]

```r
ggplot(attrition, aes(y = Gender, fill = Gender)) +
  geom_bar() +
  facet_wrap(~ EducationField) +
  theme(legend.position = "bottom")
```

]
]

???

- because the totals differs between the fields, it is -e.g.- hard to see the percentage of women WITHIN the human resources category.
- default: position_stack
- try: position_fill

---

## Avoid redundancy

.panelset[
.panel[.panel-name[High redundancy]

.font130[&#128683; DON'T]

```r
ggplot(attrition, 
  aes(y = Gender, fill = Gender, color = Gender, linetype = Gender, alpha = Gender)) +
  geom_bar(size = 2) +
  facet_wrap(~ EducationField) +
  scale_color_brewer(palette = "Set1")
```

```
## Warning: Using alpha for a discrete variable is not advised.
```

]

.panel[.panel-name[Low redundancy]

```r
ggplot(attrition, aes(y = Gender)) +
  geom_bar() +
  facet_wrap(~ EducationField)
```

]

---

## Keep scales consistent

.font130[&#128683; DON'T]

```r
ggplot(attrition, aes(y = Gender)) +
  geom_bar() +
* facet_wrap(~ EducationField, scales = "free_x")
```

---

## Use meaningful and nonredundant labels

.panelset[
.panel[.panel-name[Without context]

```r
ggplot(umfrage_avg, aes(x = party, y = popularity)) +
  geom_col()
```

]

.panel[.panel-name[With context]

```r
ggplot(umfrage_avg, aes(x = party, y = popularity)) +
  geom_col()+
  labs(x = NULL, y = "Popularity (%)", title = "German parliament election poll",
    subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"),
    caption = "Data source: https://www.wahlrecht.de/umfragen/")
```

]

---

## Select meaningful colors

.panelset[
.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

```r
umfrage_avg <- umfrage_avg %>% mutate(party = fct_reorder(party, -popularity))
ggplot(umfrage_avg, aes(x = party, y = popularity, fill = party)) +
  geom_col() +
  labs(x = NULL, y = "Popularity (%)", title = "German parliament election poll",
    subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"),
    caption = "Data source: https://www.wahlrecht.de/umfragen/") +
  theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) +
* scale_fill_manual(values = c("CDU/CSU" = "#000000", "GRÜNE" = "#1FAF12", "SPD" = "#E30013",
*"AfD" = "#009DE0", "DIE LINKE" = "#DF007D", "FDP" = "#FFED00", "Other" = "gray80"))
```

]

---

## Be selective with redundancy

.panelset[
.panel[.panel-name[Plot]

]

.panel[.panel-name[Code]

```r
umfrage_avg <- umfrage_avg %>% mutate(party = fct_reorder(party, -popularity))
ggplot(umfrage_avg, aes(x = party, y = popularity, fill = party)) +
  geom_col() +
  labs(x = NULL, y = "Popularity (%)", title = "German parliament election poll",
    subtitle = glue::glue("Percentages represent average values across 8 polling institutes\nTime period: {date_range_chr}"),
    caption = "Data source: https://www.wahlrecht.de/umfragen/") +
  theme(plot.subtitle = element_text(size = rel(0.8), face = "italic")) +
  scale_fill_manual(values = c("CDU/CSU" = "#000000", "GRÜNE" = "#1FAF12", "SPD" = "#E30013",
"AfD" = "#009DE0", "DIE LINKE" = "#DF007D", "FDP" = "#FFED00", "Other" = "gray80")) +
  guides(fill = FALSE)
```

]

---

## Select meaningful colors

.panelset[
.panel[.panel-name[No color]

]

.panel[.panel-name[ColorBrewer website]

]

.panel[.panel-name[Manual colors]

```r
ggplot(attrition, aes(y = fct_rev(Gender), fill = Gender)) +
  geom_bar() +
  facet_wrap(~ EducationField) +
* scale_fill_manual(values = c("Female" = "#7fc97f", "Male" = "#fdc086")) +
  guides(fill = FALSE)
```

]

.panel[.panel-name[RColorBrewer package]

```r
RColorBrewer::display.brewer.all()
```

]

.panel[.panel-name[Palette]

```r
ggplot(attrition, aes(y = fct_rev(Gender), fill = Gender)) +
  geom_bar() +
  facet_wrap(~ EducationField) +
* scale_fill_brewer(palette = "Pastel2") +
  guides(fill = FALSE)
```

]

---

## Session info

```
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  ctype    English_United States.1252  
##  tz       Europe/Berlin               
##  date     2021-03-29
```

.pull-left[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> package </th>
   <th style="text-align:left;"> version </th>
   <th style="text-align:left;"> date </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> dplyr </td>
   <td style="text-align:left;"> 1.0.5 </td>
   <td style="text-align:left;"> 2021-03-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> forcats </td>
   <td style="text-align:left;"> 0.5.1 </td>
   <td style="text-align:left;"> 2021-01-27 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gapminder </td>
   <td style="text-align:left;"> 0.3.0 </td>
   <td style="text-align:left;"> 2017-10-31 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ggplot2 </td>
   <td style="text-align:left;"> 3.3.3 </td>
   <td style="text-align:left;"> 2020-12-30 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> purrr </td>
   <td style="text-align:left;"> 0.3.4 </td>
   <td style="text-align:left;"> 2020-04-17 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
</tbody>
</table>

]

.pull-right[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> package </th>
   <th style="text-align:left;"> version </th>
   <th style="text-align:left;"> date </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> readr </td>
   <td style="text-align:left;"> 1.4.0 </td>
   <td style="text-align:left;"> 2020-10-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> stringr </td>
   <td style="text-align:left;"> 1.4.0 </td>
   <td style="text-align:left;"> 2019-02-10 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tibble </td>
   <td style="text-align:left;"> 3.1.0 </td>
   <td style="text-align:left;"> 2021-02-25 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidyr </td>
   <td style="text-align:left;"> 1.1.3 </td>
   <td style="text-align:left;"> 2021-03-03 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidyverse </td>
   <td style="text-align:left;"> 1.3.0 </td>
   <td style="text-align:left;"> 2019-11-21 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
</tbody>
</table>

]

</div>

---

class: last-slide, center, bottom

# Thank you! Questions?

&nbsp;

.courtesy[&#x1F4F7; Photo courtesy of Stefan Berger]