Good practices for programming with R

# 09 - Good practices for programming with R

## Data Science with R &#183; Summer 2021

### Uli Niemann &#183; Knowledge Management & Discovery Lab

#### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/)

---

## Outline

- [Miscellaneous tips for reproducible programming](#misc)
- [Reproducible and efficient project pipelines with drake](#drake)
- [Recommendations on naming files & folders](#naming)
- [Package library management with renv](#renv)
- [Debugging](#debugging)

### Recommended reading

&#x1F4DA; Jennifer Bryan, and Jim Hester. [What They Forgot to Teach You About R](https://rstats.wtf/). 2019. Last accessed 10.05.2021.

]

]

???

- now you know some of the most useful packages
- however, you would like to improve the efficacy of your programming habits
- miscellaneous tips I wish somebody had told me when I started programming 
with R
- book: compendium of tips, tricks and best practices

- &#x1F4DA; Jonathan McPherson. [Debugging with RStudio](https://support.rstudio.com/hc/en-us/articles/200713843). Last accessed 10.05.2021.

&#x1F4DA; Will Landau, Kirill Müller, Alex Axthelm, Jasper Clarkberg, Lorenz Walthert, Ellis Hughes, and Matthew Mark Strasiotto. 
[The drake R Package User Manual](https://books.ropensci.org/drake/). 2020.

&#x1F4DA; Jennifer Bryan. [Code smells and feels](https://www.youtube.com/watch?v=7oyiPBjLAWY). 
Keynote at the useR! conference 2018.

https://www.tidyverse.org/blog/2017/12/workflow-vs-script/

---

&nbsp;

## Miscellaneous tips for reproducible programming

]

]

---

&nbsp;

#### &#x2705; Save code, not the workspace

&xrarr; Save your commands in an R script (.R file) or R Markdown 
document (.Rmd) instead of saving the results of your interactive 
analysis as workspace (.Rdata)

&nbsp;

]

Tip: disable automatic saving of workspace in RStudio

]

???

- when you start programming with R, you tend to predominantly use the console
- at the end: save result as workspace -> discouraged
- when you load an .RData file after 3 months, do you remember every single step
you executed to get to this workspace?
- better: Save your commands in an R script (.R file) or R Markdown 
document (.Rmd) instead of saving the results of your interactive 
analysis as workspace (.Rdata)
- these files may be a bit messy (prototyping) 
- don't need to be perfectly polished

---

&nbsp;

#### &#x2705; Restart R often

- Restart R in RStudio: _Session &rarr; Restart R_  
(keyboard shortcut **Ctrl**+**⇧Shift**+**F10**)
- Restart R from the shell: **Ctrl**+**D** or `q()` to quit current session, 
then restart R
- Restart development where you left off, i.e., re-run all code above cursor position:
  - in an R script: **Ctrl**+**Alt**+**B**
  - in an R Markdown file: **Ctrl**+**Alt**+**P**

&nbsp;

]

Tip: restart R every now and then

]

???

- to ensure that your code doesn't depend on settings of your current interactive session
- e.g. global options, unusual order of packages in search path, or other objects in workspace

---

#### &#x2705; Avoid using `rm(list = ls())`

Unlike restarting R, the command `rm(list = ls())` .red[**does not**] 
create a new R session.

- &#x2757; It only removes all user-created objects from the environment.
- &#x2757; Previously loaded packages are still in the search path.
- &#x2757; Modifications to global options are still there.

`rm(list = ls())` makes a script **vulnerable to hidden dependencies** 
on commands that were ran in the R session before executing `rm(list = ls())`.

???

- popular modification: `options(scipen = 999)` to get rid of scientific notation of numbers
- scipen: integer. A penalty to be applied when deciding to print numeric values in fixed or exponential notation.

```r
library(Hmisc)
library(dplyr)
summarize(iris$Sepal.Width, iris$Species, mean)

rm(list = ls())

library(Hmisc)
summarize(iris$Sepal.Width, iris$Species, mean)

hist(rnorm(1000, 1e10)) # draw random sample of 1000 values from a normal distribution with a mean = 10 billion

options(scipen = 999)
hist(rnorm(1000, 1e10))

rm(list = ls())

hist(rnorm(1000, 1e10))
```

---

&#x1F914; _"What about time-consuming scripts that take hours or maybe even 
days? It is impracticable to re-run these scripts every time we need the 
results."_

???

Rerunning an end-to-end analysis is sometimes impractical, because parts of the 
computations are very time-consuming.

It is recommended to split this computationally heavy part of your code into 
a new script. At the end of this script, you save the output as `.rds` file:

`saveRDS(data_prep, "results/data_preprocessed.rds"))`

Split the computationally expensive part of your code 
(e.g. fetching and pre-processing large amounts of raw data from a database) 
into a new script. 
At the end of this script, you save the output as `.rds` file (<u>R</u> 
<u>data</u> <u>single</u>).

`saveRDS(data_prep, "results/data_preprocessed.rds")`

---

#### &#x2705; Organize code as self-contained "projects":

- Organize each logical project into a separate directory.
- Organize all files related to the project into subfolders, e.g. `data/`, 
`code/`, `figures/`, `reports/`, etc.
- The top-level folder of the project must be clearly recognizable 
(e.g. by containing a .Rproj file)
- Create paths **relative to the top-level folder**.
- Launch the R session from the project’s top-level folder.

]

.footnote[
&#x1F4DA; Jennifer Bryan. [Project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/). 2017. Last accessed 10.05.2021.

]

???

- separate directory and subfolders orga: not R-specific, general good practice
- advantage of here: 
_How can we avoid setwd() at the top of every script?_

.. always start at the top-level folder of the project. -> handy for rmd reports

---

## RStudio projects

Create a new RStudio project:  
_File_ &rarr; _New Project_ &rarr; _New Directoy_ &rarr; _Empty Project_

The working directory corresponds to the top-level folder of the project 
(where the `.Rproj` file is). All project files should be organized in 
subfolders of the project's top-level folder. In the code, the files should 
be referenced with relative paths.

]

![](figures//09-new_rstudio_project.gif)

]

&#x1F4DA; Further reading:

- Best practices: [RStudio Projects](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects)
- Chapter [Where does your analysis live?](http://r4ds.had.co.nz/workflow-projects.html#where-does-your-analysis-live) in [R for Data Science](http://r4ds.had.co.nz/)
- [Project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/)

]

???

- file extension `.Rproj`: text file with project-specific settings
- hidden folder `.Rproj.user`: project-specific temporary files (_better not touch it_)
- open .Rproj: starts R, working directory to top-level folder 
of project, restore previously opened files (pick up where you left off)
- switch to another .Rproj: new R session, working directory to top-level folder 
of project, restore previously opened files (pick up where you left off)
- simultaneously work on 2 projects where each project has its own R session

---

A basic folder structure could look like so:

```
my_project
+-- data
|   +-- processed
|   \-- raw
+-- my_project.Rproj
+-- output
+-- R
+-- README.md
\-- run_analyses.R
```

???

also helpful: https://chrisvoncsefalvay.com/2018/08/09/structuring-r-projects/

---

#### &#x2705; Use `here::here()`<sup>1</sup> instead of `setwd()` to set your working directory.

&#x1F620; Specifying absolute file paths are bad practice:

```r
library(ggplot2)
setwd("C:/Users/uli/verbose_funicular/foofy/data")
df <- read.delim("raw_foofy_data.csv")
p <- ggplot(df, aes(x, y)) + geom_point()
ggsave("../figs/foofy_scatterplot.png")
```

]

&xrarr; not self-contained, not portable

- file paths will not work for anyone besides the author
- high chance that objects from one project will leak into another project

]

.footnote[
<sup>1</sup>`here::here()` is a robust version of `file.path()`, because 
`here()` creates paths that (1) work across different operating systems and (2) always start at the top-level folder of the project.  
Example from: Jennifer Bryan. [Project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/#whats-wrong-with-setwd). 2017. Last accessed 10.05.2021.

]

&#x1F60A; Use file paths relative to project root folder.

```r
library(ggplot2)
*library(here)
*df <- read.delim(here("data", "raw_foofy_data.csv"))
p <- ggplot(df, aes(x, y)) + geom_point()
*ggsave(here("figs", "foofy_scatterplot.png"))
```

]

???

- it's convenient to put setwd() at the top of each script
- if we move our folders containing raw data or R scripts we need to change the 
file path at every occurrence of setwd() which can be tedious 
- doesn't even work for the same author when switching to a new computer or OS
- example of bad practice: not self-contained, not portable 
- file paths will not work for anyone besides the author
- chance that objects from one project will leak into the set of objects 
from another project in case a new R session is not started

---

Exemplary content in `R` folder of your project:

File | Description
:--- | :----------
`01_fetch-data.R` | Fetch raw data from a database with millions of rows
`02_clean-data.R` | Pre-process and tidies data
`03-eda.R`        | Exploratory data analysis, including plotting the data
`04-model.R`      | Train a prediction model (takes some hours &#x23F1;)
`05-report.R`     | Create an R Markdown project report 
`run-analysis.R`  | Successively runs all scripts above

&#x1F914; _"If we want to make a small change in one of these scripts, do we 
have to rerun all scripts from scratch again?"_

]

???

- saving intermediate objects as .rds becomes impracticable for 
more complex projects involving multiple data sources, preprocessing, 
modeling and multiple scripts for dissemination
- q

<div>
<blockquote class="twitter-tweet" data-conversation="none" data-dnt="true"><p lang="en" dir="ltr">Save me from myself and having to remember all this when files change <a href="https://t.co/hVeSFQOimj">pic.twitter.com/hVeSFQOimj</a></p>&mdash; Brianna McHorse, PhD <a href="https://twitter.com/fossilosophy/status/966408174470299648?ref_src=twsrc%5Etfw">February 21, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</div>
[URL](https://t.co/hVeSFQOimj)

]

???

sometimes not a simple pipeline but complex graph of dependencies between inputs and outputs

---

&nbsp;

## targets: reproducible and efficient project pipelines

]

]

---

## The [targets](https://docs.ropensci.org/targets/index.html) package

> The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets skips costly runtime for tasks that are already up to date, runs the necessary computation with implicit parallel computing, and abstracts files as R objects. A fully up-to-date targets pipeline is tangible evidence that the output aligns with the code and data, which substantiates trust in the results.  
&mdash; <https://docs.ropensci.org/targets/index.html>

]

]

&nbsp;

- With targets, we can set up a **pipeline of successively executed commands**.
- The **return value** of each command is stored as a **target**.
- The targets package automatically identifies **dependencies** between **targets**. 
- The targets package ensures that only targets **affected** by a change in one of its inputs (= **outdated targets**) need to be recomputed.

---

---

Every targets workflow is set up in a file called `_targets.R`:

```r
library(targets)
library(tarchetypes)
options(tidyverse.quiet = TRUE)

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) + 
    geom_histogram()
}

tar_option_set(packages = c("dplyr", "forcats", "ggplot2", "readr", "tidyr"))

list(
  tar_target(raw_data, readxl::read_excel("targets_files/raw_iris.xlsx")),
  tar_target(my_data, raw_data %>% mutate(Species = fct_inorder(Species))),
  tar_target(hist, create_plot(my_data)),
  tar_target(fit, lm(Sepal.Width ~ Petal.Width + Species, my_data)),
  tar_render(report, "targets_files/report.Rmd")
)
```

???

- example data analysis pipeline (explain)
- first argument = targets
- second argument = commands that produce the target
- surround input and output files with some special functions so that drake knows that these files also have to be checked for changes 
- print plan: simple tibble containing the names of the targets and the associated 
command.

---

Show a data frame with information about the targets:

```r
tar_manifest() %>% select(name, command)
```

```
## # A tibble: 5 x 2
##   name     command                                                                        
##   <chr>    <chr>                                                                          
## 1 raw_data "readxl::read_excel(\"targets_files/raw_iris.xlsx\")"                          
## 2 my_data  "raw_data %>% mutate(Species = fct_inorder(Species))"                          
## 3 fit      "lm(Sepal.Width ~ Petal.Width + Species, my_data)"                             
## 4 hist     "create_plot(my_data)"                                                         
## 5 report   "tarchetypes::tar_render_run(path = \"targets_files/report.Rmd\",  \\n     arg~
```

---

Show the dependencies between the targets:

```r
tar_visnetwork()
```

<div id="htmlwidget-fda05717e12b3b0b365b" style="width:992.125984251968px;height:538.582677165354px;" class="visNetwork html-widget"></div>
<script type="application/json" data-for="htmlwidget-fda05717e12b3b0b365b">{"x":{"nodes":{"name":["fit","hist","my_data","raw_data","report","create_plot"],"type":["stem","stem","stem","stem","stem","function"],"status":["outdated","outdated","outdated","outdated","outdated","outdated"],"seconds":[null,null,null,null,null,null],"bytes":[null,null,null,null,null,null],"branches":[null,null,null,null,null,null],"id":["fit","hist","my_data","raw_data","report","create_plot"],"label":["fit","hist","my_data","raw_data","report","create_plot"],"level":[3,3,2,1,4,1],"color":["#78B7C5","#78B7C5","#78B7C5","#78B7C5","#78B7C5","#78B7C5"],"shape":["dot","dot","dot","dot","dot","triangle"]},"edges":{"from":["raw_data","fit","hist","my_data","create_plot","my_data"],"to":["my_data","report","report","fit","hist","hist"],"arrows":["to","to","to","to","to","to"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot","physics":false},"manipulation":{"enabled":false},"edges":{"smooth":{"type":"cubicBezier","forceDirection":"horizontal"}},"physics":{"stabilization":false},"layout":{"hierarchical":{"enabled":true,"direction":"LR"}}},"groups":null,"width":null,"height":null,"idselection":{"enabled":false,"style":"width: 150px; height: 26px","useLabels":true,"main":"Select by id"},"byselection":{"enabled":false,"style":"width: 150px; height: 26px","multiple":false,"hideColor":"rgba(200,200,200,0.5)","highlight":false},"main":{"text":"","style":"font-family:Georgia, Times New Roman, Times, serif;font-weight:bold;font-size:20px;text-align:center;"},"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","highlight":{"enabled":false,"hoverNearest":false,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":true,"fit":false,"resetHighlight":true,"clusterOptions":null,"keepCoord":true,"labelSuffix":"(cluster)"},"legend":{"width":0.2,"useGroups":false,"position":"right","ncol":1,"stepX":100,"stepY":100,"zoom":true,"nodes":{"label":["Outdated","Stem","Function"],"color":["#78B7C5","#899DA4","#899DA4"],"shape":["dot","dot","triangle"]},"nodesToDataframe":true}},"evals":[],"jsHooks":[]}</script>

???

- we can visually check whether we have correctly set up our drake workflow
- dependency graph powered by visNetwork, a package for network visualization, using the vis.js javascript library (http://visjs.org)
- interactive: click, drag, zoom
- color and shape legend on the left
- outdated = have not been created yet
- AUTOMATICALLY identifies dependencies between targets
- example: "report" knits an Rmd, contains references to fit and hist

- outdated target: black 
- imported target: blue
- failed target: red
- up to date target: green
- function: triangle
- object: circle
- file: square

---

#### report.Rmd

???

- tar_read(): return built target from cache
- tar_load(): loads built target from cache into environment

---

Now we can run the pipeline defined in `_targets.R`. 
The function creates the targets in the correct order and stores the return values in `_targets/objects/`.

```r
tar_make()
```

```
## * start target raw_data
## * built target raw_data
## * start target my_data
## * built target my_data
## * start target fit
## * built target fit
## * start target hist
## * built target hist
## * start target report
## * built target report
## * end pipeline
```

---

```r
tar_visnetwork()
```

<div id="htmlwidget-5301a29265a823dafa63" style="width:992.125984251968px;height:538.582677165354px;" class="visNetwork html-widget"></div>
<script type="application/json" data-for="htmlwidget-5301a29265a823dafa63">{"x":{"nodes":{"name":["fit","hist","my_data","raw_data","report","create_plot"],"type":["stem","stem","stem","stem","stem","function"],"status":["uptodate","uptodate","uptodate","uptodate","uptodate","uptodate"],"seconds":[0.14,0.04,0.03,1.32,2.79,null],"bytes":[5511,43909,1153,1094,7430,null],"branches":[null,null,null,null,null,null],"id":["fit","hist","my_data","raw_data","report","create_plot"],"label":["fit","hist","my_data","raw_data","report","create_plot"],"level":[3,3,2,1,4,1],"color":["#354823","#354823","#354823","#354823","#354823","#354823"],"shape":["dot","dot","dot","dot","dot","triangle"]},"edges":{"from":["raw_data","fit","hist","my_data","create_plot","my_data"],"to":["my_data","report","report","fit","hist","hist"],"arrows":["to","to","to","to","to","to"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot","physics":false},"manipulation":{"enabled":false},"edges":{"smooth":{"type":"cubicBezier","forceDirection":"horizontal"}},"physics":{"stabilization":false},"layout":{"hierarchical":{"enabled":true,"direction":"LR"}}},"groups":null,"width":null,"height":null,"idselection":{"enabled":false,"style":"width: 150px; height: 26px","useLabels":true,"main":"Select by id"},"byselection":{"enabled":false,"style":"width: 150px; height: 26px","multiple":false,"hideColor":"rgba(200,200,200,0.5)","highlight":false},"main":{"text":"","style":"font-family:Georgia, Times New Roman, Times, serif;font-weight:bold;font-size:20px;text-align:center;"},"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","highlight":{"enabled":false,"hoverNearest":false,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":true,"fit":false,"resetHighlight":true,"clusterOptions":null,"keepCoord":true,"labelSuffix":"(cluster)"},"legend":{"width":0.2,"useGroups":false,"position":"right","ncol":1,"stepX":100,"stepY":100,"zoom":true,"nodes":{"label":["Up to date","Stem","Function"],"color":["#354823","#899DA4","#899DA4"],"shape":["dot","dot","triangle"]},"nodesToDataframe":true}},"evals":[],"jsHooks":[]}</script>

???

- black circles have become green (outdated -> up to date)

---

Return a built target with `tar_read()`:

```r
tar_read(hist)
```

]

_What happens if we would like to make some changes to the histogram?_
The number of bins should be set to 15 and a different theme should be applied.

We want the histogram to look like this:

]

???

skipping any work that is already up to date

---

We change the plotting function accordingly:

```r
library(ggthemes)
create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) + 
*   geom_histogram(bins = 15) +
*   theme_fivethirtyeight(base_size = 18)
}
```

---

`_targets.R` now looks like this:

```r
library(targets)
library(tarchetypes)
options(tidyverse.quiet = TRUE)

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) + 
    geom_histogram(bins = 15) +
    theme_fivethirtyeight(base_size = 18)
}

tar_option_set(packages = c("dplyr", "forcats", "ggplot2", "ggthemes", "readr", "tidyr"))

---

```r
tar_visnetwork()
```

<div id="htmlwidget-a79f35e62928d1f7f4e9" style="width:907.086614173228px;height:453.543307086614px;" class="visNetwork html-widget"></div>
<script type="application/json" data-for="htmlwidget-a79f35e62928d1f7f4e9">{"x":{"nodes":{"name":["fit","hist","my_data","raw_data","report","create_plot"],"type":["stem","stem","stem","stem","stem","function"],"status":["uptodate","outdated","uptodate","uptodate","outdated","outdated"],"seconds":[0.14,0.04,0.03,1.32,2.79,null],"bytes":[5511,43909,1153,1094,7430,null],"branches":[null,null,null,null,null,null],"id":["fit","hist","my_data","raw_data","report","create_plot"],"label":["fit","hist","my_data","raw_data","report","create_plot"],"level":[3,3,2,1,4,1],"color":["#354823","#78B7C5","#354823","#354823","#78B7C5","#78B7C5"],"shape":["dot","dot","dot","dot","dot","triangle"]},"edges":{"from":["raw_data","fit","hist","my_data","create_plot","my_data"],"to":["my_data","report","report","fit","hist","hist"],"arrows":["to","to","to","to","to","to"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot","physics":false},"manipulation":{"enabled":false},"edges":{"smooth":{"type":"cubicBezier","forceDirection":"horizontal"}},"physics":{"stabilization":false},"layout":{"hierarchical":{"enabled":true,"direction":"LR"}}},"groups":null,"width":null,"height":null,"idselection":{"enabled":false,"style":"width: 150px; height: 26px","useLabels":true,"main":"Select by id"},"byselection":{"enabled":false,"style":"width: 150px; height: 26px","multiple":false,"hideColor":"rgba(200,200,200,0.5)","highlight":false},"main":{"text":"","style":"font-family:Georgia, Times New Roman, Times, serif;font-weight:bold;font-size:20px;text-align:center;"},"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","highlight":{"enabled":false,"hoverNearest":false,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":true,"fit":false,"resetHighlight":true,"clusterOptions":null,"keepCoord":true,"labelSuffix":"(cluster)"},"legend":{"width":0.2,"useGroups":false,"position":"right","ncol":1,"stepX":100,"stepY":100,"zoom":true,"nodes":{"label":["Up to date","Outdated","Stem","Function"],"color":["#354823","#78B7C5","#899DA4","#899DA4"],"shape":["dot","dot","dot","triangle"]},"nodesToDataframe":true}},"evals":[],"jsHooks":[]}</script>

Note that the targets `hist` and `report` (and only these targets) have 
become outdated because they depend on the modified function `create_plot()`.

???

- the target hist depends on the create_plot() function and the target report 
in turn depends on hist
- these targets have become outdated
since drake automatically created this provenance graph of dependencies, 
it knows that the other three targets raw_data, data and fit are not affected by 
the change in create_plot and thus, their code does not need to be rerun, 
drake takes the values from the cache

---

We rerun the pipeline Only `hist` and `report` are recomputed. The values of the 
other targets are pulled from the cache.

```r
tar_make()
```

```
## v skip target raw_data
## v skip target my_data
## v skip target fit
## * start target hist
## * built target hist
## * start target report
## * built target report
## * end pipeline
```

---

In the dependency graph, all targets are up-to-date again:

```r
tar_visnetwork()
```

<div id="htmlwidget-9e2e826572fee5683d68" style="width:992.125984251968px;height:538.582677165354px;" class="visNetwork html-widget"></div>
<script type="application/json" data-for="htmlwidget-9e2e826572fee5683d68">{"x":{"nodes":{"name":["fit","hist","my_data","raw_data","report","create_plot"],"type":["stem","stem","stem","stem","stem","function"],"status":["uptodate","uptodate","uptodate","uptodate","uptodate","uptodate"],"seconds":[0.14,0.81,0.03,1.32,1.91,null],"bytes":[5511,45426,1153,1094,6606,null],"branches":[null,null,null,null,null,null],"id":["fit","hist","my_data","raw_data","report","create_plot"],"label":["fit","hist","my_data","raw_data","report","create_plot"],"level":[3,3,2,1,4,1],"color":["#354823","#354823","#354823","#354823","#354823","#354823"],"shape":["dot","dot","dot","dot","dot","triangle"]},"edges":{"from":["raw_data","fit","hist","my_data","create_plot","my_data"],"to":["my_data","report","report","fit","hist","hist"],"arrows":["to","to","to","to","to","to"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot","physics":false},"manipulation":{"enabled":false},"edges":{"smooth":{"type":"cubicBezier","forceDirection":"horizontal"}},"physics":{"stabilization":false},"layout":{"hierarchical":{"enabled":true,"direction":"LR"}}},"groups":null,"width":null,"height":null,"idselection":{"enabled":false,"style":"width: 150px; height: 26px","useLabels":true,"main":"Select by id"},"byselection":{"enabled":false,"style":"width: 150px; height: 26px","multiple":false,"hideColor":"rgba(200,200,200,0.5)","highlight":false},"main":{"text":"","style":"font-family:Georgia, Times New Roman, Times, serif;font-weight:bold;font-size:20px;text-align:center;"},"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)","highlight":{"enabled":false,"hoverNearest":false,"degree":1,"algorithm":"all","hideColor":"rgba(200,200,200,0.5)","labelOnly":true},"collapse":{"enabled":true,"fit":false,"resetHighlight":true,"clusterOptions":null,"keepCoord":true,"labelSuffix":"(cluster)"},"legend":{"width":0.2,"useGroups":false,"position":"right","ncol":1,"stepX":100,"stepY":100,"zoom":true,"nodes":{"label":["Up to date","Stem","Function"],"color":["#354823","#899DA4","#899DA4"],"shape":["dot","dot","triangle"]},"nodesToDataframe":true}},"evals":[],"jsHooks":[]}</script>

---

&nbsp;

## Recommendations on naming<br />files & folders

]

]

---

.footnote[[PDF version](http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf)]

???

- avoid special characters such as German Umlaute, space character and capitalization
- stick to lower-case letters, numbers, hyphen and underscore characters
- 3 principles
- 1: machine readable: structure names in a way that let's you refind them using 
regular expressions
- elements of a semantic unit are connected with hyphens, e.g. first sematic unit 
"date": year, month and date are connected by hyphen
- different semantic units are separated by underscore, e.g. date, 
assay, plasmid type, well number
- use str_split_fixed to recover meta-data from file names
- 2: human readable: instead of 01.R, 02.R etc, put some information in the file name 
on what the script does -> 01_read-data.R, 02-filter-data.R, etc.
- again semantic units: prefix (order of execution), then what the file is about
- plays well with default ordering: chronological (data that have time-stamps) 
or logical (order of execution in a project workflow)
- longtime payoff for little short-term pain

---

&nbsp;

## Package library management with renv

]

]

---

## Package library management

- R packages are stored in one or more directories on your computer (`.libPaths()`)
  + **user library**: packages that _you_ install are (usually) contained in `.libPaths()[1]`
  + **system library**: default packages are (usually) contained in `.libPaths()[2]`

- RStudio Packages tab provides a convenient interface for installing and 
updating packages

???

- why CRAN: most effective way to make a package available for public use
- Linux: automatically have tools installed to install "bundled" version of a package (.tar.gz)

- Use `install.packages(<pkgname>)` to install or update packages via the console
  + installs **_binary_** version from CRAN (no additional tools required)
  + binary: single, OS-specific file (.zip for Windows, .tgz for macOS)

Sometimes, you need the latest development version hosted on GitHub, BitBucket, etc.

- Use `remotes::install_github(<username>/<pckname>)` to install or packages via the console 
  + installs **_source_** version 
  + source: OS-independent directory with a specific structure
  + example: `remotes::install_github("rstudio/gt")`
  + additional tools for compilation necessary, e.g. Rtools for Windows
  
---

## Package library management

- many packages are continuously under development even after CRAN release
- package maintainers generally pay attention to backwards compatibility, but there is no guarantee that there will not be changes that introduce existing code to fail

Scenarios where we need another way of package management:

- &#x2757; _We want to ship the final project code to our customers. The code does not only have to work now, but also in 3 months, 6 months, 1 year from now on._
- &#x2757; _We do not want to keep track of code-breaking package updates._
- &#x2757; _For other projects, we must be able to update packages, in order to make use of new useful features._

???

- we have a finished, stable project
- we don't want to make changes to your code every time there is a code-breaking  change in one of the packages
- we still want to be able to update our packages for other projects so that we can work with the latest features of the packages
- we want your project code to be reproducible, portable and isolated from other projects

For example, tools like renv (and its predecessor packrat) automate the process of managing project-specific libraries. This can be important for making data products reproducible, portable, and isolated from one another. A package developer might prepend the library search path with a temporary library, containing a set of packages at specific versions, in order to explore issues with backwards and forwards compatibility, without affecting other day-to-day work

---

## Package library management with renv

`renv`: **dependency manager** that helps to set up and maintain **project-local R libraries**

- **isolated**: each project has its own library of packages
  + &xrarr; updating packages globally or in another package will not affect current project
- **portable**: version numbers of all "active" packages are tracked in a "lock file" 
  + &xrarr; facilitates collaboration
- **reproducible**: enables saving a "snapshot", i.e., a state of the project library that you know is working 
  + &xrarr; safety net in case a package update causes problems 
- **disk-space-efficient**: packages are installed into a global cache; different projects that use the same version of a package will pull a "shared" package installation from the global cache 
  + &xrarr; also reduces installation time when a package has already been installed by another project that is managed with renv

]

]
  
.footnote[&#x1F4DA; Kevin Ushey. Introduction to renv. <https://rstudio.github.io/renv/articles/renv.html>. Last accessed 10.05.2021.  
&#x1F4DA; RStudio. Upgrading Packages: How to Safely Upgrade Packages. <https://environments.rstudio.com/upgrades.html>. Last accessed 10.05.2021.
]

???

- renv-test
- renv.lock
- .libPaths()

renv vocabulary

- Tools -> Project Options -> Environments -> Use renv with this project

- renv::init(): initialize new project-local package library
- renv:: snapshot(): save the current state of the project package library to 
a "lock file" (`renv.lock`)
- renv::restore(): revert to previously saved state

---

&#x1F4FA; Kevin Ushey. [renv: Project Environments for R. Keynote at RSTUDIO::CONF 2020.](https://rstudio.com/resources/rstudioconf-2020/renv-project-environments-for-r/) Last accessed 10.05.2021.

---

## Activate project-local package library in RStudio

Create a new RStudio project with renv

]

Use renv in your existing RStudio project

]

---

## Example of renv.lock file

???

- JSON file
- version of R used in that project
- list of R repositories
- package records: name of R package, version number, installation source

---

&nbsp;

## Debugging

]

]

---

## Error handling

How to find the source of an error in your R code?

Recommended strategy:

1. **Restart R.** &rarr; Make sure the error _always_ occurs.

???

1. rule out errors due to hidden dependencies
1. it is very likely that somebody else had the error before you and posted this error and meanwhile there is a working solution
1. short intro
1. Use internet communities (Twitter, SO, RStudio Community, ...). 
Includes creating a minimal reproducible example (code snippet)
  - reproducible: self-contained, everything needed to reproduce the error message is included
  - minimal: remove code that is irrelevant for the error message. helps isolating the cause of the error
  
- automated testing

1. **Google the error message.** &rarr; It is likely that there is already an existing solution on Stack Overflow, RStudio Community, Twitter or other fora.

1. **Debug.**

1. **Ask for help.** &rarr; Prepare a [reproducible example](https://reprex.tidyverse.org/) and ask the internet.

---

## Debugging

```r
f <- function(a) g(a)
g <- function(b) h(b)
h <- function(c) i(c)
i <- function(d) {
  if (!is.numeric(d)) {
    stop("`d` must be numeric", call. = FALSE)
  }
  d + 10
}
```

&#x1F914; What is the result of this code when we run `f("10")`?

Useful resources:

&#x1F4DA; Example from: Hadley Wickham. Advanced R. [Chapter Debugging](https://adv-r.hadley.nz/debugging.html). Second Edition. Chapman & Hall/CRC, 2019.  
&#x1F4DA; Jonathan McPherson. [Debugging with RStudio](https://support.rstudio.com/hc/en-us/articles/200713843). Last accessed 10.05.2021.  
&#x1F4FA; Jenny Bryan. [Object of type ‘closure’ is not subsettable ](https://rstudio.com/resources/rstudioconf-2020/object-of-type-closure-is-not-subsettable/). Keynote at RSTUDIO::CONF 2020. Last accessed 10.05.2021.  
&#x1F4DA; Garett Grolemund. Hands-On Programming with R. [Appendix E: Debugging R Code.](https://rstudio-education.github.io/hopr/debug.html). O'Reilly, 2014.

]

```r
f("10")
```

```
## Error: `d` must be numeric
```

---

## Interactive debug mode

???

- show **traceback** aka the call stack: sequence of functions called that led to the error
- rerun with Debug: set a break point immediately before the error is thrown
  - code execution is paused
  - allows to inspect environment of a function

1. set break point (click on gray space left to line number or **⇧Shift**+**F9**)

???

- stop executing code exactly at the line of the break point

2. console debugging commands (`n,s,f,c,Q`)

???

- n: execute next step
- s: step into function
- f: finish execution of current loop/function
- c: stop debugging & continue code execution until next break point
- Q: leave interactive debugging & continue code execution

3. `browser()`

???

- similar to first approach, but source code is changed

4. `debug()`, `debugOnce()`

???

- just for completeness: adds break point at the beginning of a function
- debugOnce: one-shot break point, i.e., the function will enter the debugger the very next time it runs, but not after that

5. `undebug()`

6. RStudio: _Debug_ &xrarr; _On Error_ &xrarr; _Break in Code_

???

- enters interactive debugging at the line where the error occurs

7. `debugSource()`

???

- advantage of browser(): recursive debugging of scripts

8. `options(error = recover)`

???

- interactive prompt that displays the traceback

9. `options(warn = 2)`

???

- convert warning to error

???

Tools > Global Options -> General -> Advanced -> Uncheck "Use debug handler only when my code contains errors"

---

---

## Session info

```
##  setting  value                       
##  version  R version 4.0.5 (2021-03-31)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  ctype    English_United States.1252  
##  tz       Europe/Berlin               
##  date     2021-05-10
```

]

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> package </th>
   <th style="text-align:left;"> version </th>
   <th style="text-align:left;"> date </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> dplyr </td>
   <td style="text-align:left;"> 1.0.5 </td>
   <td style="text-align:left;"> 2021-03-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> forcats </td>
   <td style="text-align:left;"> 0.5.1 </td>
   <td style="text-align:left;"> 2021-01-27 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ggplot2 </td>
   <td style="text-align:left;"> 3.3.3 </td>
   <td style="text-align:left;"> 2020-12-30 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ggthemes </td>
   <td style="text-align:left;"> 4.2.4 </td>
   <td style="text-align:left;"> 2021-01-20 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> kableExtra </td>
   <td style="text-align:left;"> 1.3.4 </td>
   <td style="text-align:left;"> 2021-02-20 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> knitr </td>
   <td style="text-align:left;"> 1.31 </td>
   <td style="text-align:left;"> 2021-01-27 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> purrr </td>
   <td style="text-align:left;"> 0.3.4 </td>
   <td style="text-align:left;"> 2020-04-17 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
</tbody>
</table>

]

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> package </th>
   <th style="text-align:left;"> version </th>
   <th style="text-align:left;"> date </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> readr </td>
   <td style="text-align:left;"> 1.4.0 </td>
   <td style="text-align:left;"> 2020-10-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> stringr </td>
   <td style="text-align:left;"> 1.4.0 </td>
   <td style="text-align:left;"> 2019-02-10 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tarchetypes </td>
   <td style="text-align:left;"> 0.2.0.9000 </td>
   <td style="text-align:left;"> 2021-05-10 </td>
   <td style="text-align:left;"> Github (ropensci/tarchetypes) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> targets </td>
   <td style="text-align:left;"> 0.4.2 </td>
   <td style="text-align:left;"> 2021-04-30 </td>
   <td style="text-align:left;"> CRAN (R 4.0.5) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tibble </td>
   <td style="text-align:left;"> 3.1.1 </td>
   <td style="text-align:left;"> 2021-04-18 </td>
   <td style="text-align:left;"> CRAN (R 4.0.5) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidyr </td>
   <td style="text-align:left;"> 1.1.3 </td>
   <td style="text-align:left;"> 2021-03-03 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidyverse </td>
   <td style="text-align:left;"> 1.3.0 </td>
   <td style="text-align:left;"> 2019-11-21 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
</tbody>
</table>

]

</div>

---

# Thank you! Questions?

&nbsp;