Linear models

# 06 - Linear models

## Data Science with R &#183; Summer 2021

### Uli Niemann &#183; Knowledge Management & Discovery Lab

#### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/)

---

## Bivariate relationships

Example data: **father-son dataset** (1078 measurements of the heights of a father and his son.)

```r
library(tidyverse)
father_son <- UsingR::father.son %>%
  as_tibble() %>%
  mutate(across(everything(), ~ . * 2.54)) # convert from inch to cm
father_son
```

```
## # A tibble: 1,078 x 2
##    fheight sheight
##      <dbl>   <dbl>
##  1    165.    152.
##  2    161.    161.
##  3    165.    161.
##  4    167.    159.
##  5    155.    163.
##  6    160.    163.
##  7    166.    163.
##  8    164.    163.
##  9    168.    164.
## 10    170.    163.
## # ... with 1,068 more rows
```
]

_How can we summarize the relationship between the two variables `fheight` and `sheight`?_

]

???

- series of three meetings where we discuss regression and classification concepts and algorithms
- today, mother of all regression methods, linear regression
- lin. Reg.: very popular methods in a lot domains, medical research 
  - exists for a long time, and relatively simple, but: 
  - a lot of modern methods build upon linear regression -> generalization or extensions of linear regression
  - understanding how to use lin. regression is a good starting point in order to be capable of using more advanced techniques
- LR can be used to assess the relationship between two or more variables
- here, we have: ca. 1000 height measurements of fathers and their sons

---

## Summary statistics

We could describe the relationship between `fheight` and `sheight` with **mean** and **standard deviation**:

```r
father_son %>% 
  summarize(
    avg_height_father = mean(fheight), 
    sd_height_father = sd(fheight), 
    avg_height_son = mean(sheight), 
    sd_height_son = sd(sheight)
  )
```

```
## # A tibble: 1 x 4
##   avg_height_father sd_height_father avg_height_son sd_height_son
##               <dbl>            <dbl>          <dbl>         <dbl>
## 1              172.             6.97           174.          7.15
```

Are mean and standard deviation sufficient?

]

---

## Visualizing bivariate relationships with...

```r
ggplot(father_son, aes(x = fheight, y = sheight)) + 
  geom_point() +
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]
.panel[.panel-name[...boxplots]

`ggplot2::cut_interval()`: convert numerical vector into factor by discretizing the values into `n` groups of equal range

```r
father_son %>% 
  mutate(fheight = cut_interval(fheight, n = 5)) %>%
  ggplot(aes(x = fheight, y = sheight)) + 
  geom_boxplot() +
  labs(x = "Father's height [cm]", 
       y = "Son's height [cm]")
```

]

]

]

???

- conveniently, we only have 2 variables here, so we can visualize them easily

---

## Characterizing bivariate relationships

]

]

]

]

---

## Examples of bivariate relationships examples

```r
library(openintro)
ggplot(bdims, aes(hgt, wgt)) + 
  geom_point() +
  labs(x = "Height [cm]", y = "Weight [kg]",
       title = "Body measurements of 507 people",
       caption = "Source: bdims data | R package openintro")
```

]

```r
ggplot(mammals, aes(x = body_wt, y = brain_wt)) + 
  geom_point() +
  labs(x = "Body weight [kg]", y = "Brain weight [kg]",
       title = "Measurements of 62 species of mammals",
       caption = "Source: mammals dataset | R package openintro")
```

]

```r
mammals %>% 
* filter(body_wt < 100, brain_wt < 100) %>%
  ggplot(aes(x = body_wt, y = brain_wt)) + 
  geom_point() +
  labs(x = "Body weight [kg]", y = "Brain weight [kg]",
       title = "Measurements of 39 mammals with\nbody_wt < 100 & brain_wt < 100",
       caption = "Source: mammals dataset | R package openintro")
```

]

```r
ggplot(MASS::Boston, aes(x = lstat, y = medv)) + 
  geom_point() +
  # geom_smooth(formula = y ~ x + I(x^2)) +
  labs(x = "Share of the lower status of the population [%]",
       y = "Median home value [1,000US$]",
       title = "Housing values in suburbs of Boston, MA",
       caption = "Source: Boston dataset | R package MASS")
```

]

```r
ggplot(smoking, aes(x = age, y = amt_weekdays)) + 
  geom_point() +
  labs(x = "Age", y = "Cigarettes smoked per weekday",
       title = "Frequency of smoking by age (UK; 2000-2009)",
       caption = "Source: smoking dataset | R package openintro")
```

```
## Warning: Removed 1270 rows containing missing values (geom_point).
```

]

???

bdims:

- form: linear
- direction: positive
- strength: moderate to high
- outliers: _few_

mammals:

- form: ??
- direction: positive
- strength: strong
- outliers: few strong outliers

smaller mammals:

- form: _rather_ linear
- direction: positive
- strength: moderate-strong
- outliers: few strong outliers

Boston: magnitude of the negative relationship is not constant for every value of lstat -> higher decrease in house prices with increasing share of the lower population between 0-10 of lstat

- form: non-linear
- direction: negative
- strength: moderate
- outliers: few

smoking:

- form: ??
- direction: ??
- strength: no - very low
- outliers: ??

---

## Correlation

**(Pearson) Correlation coefficient** `$\rho$`: quantification of the **strength** and **direction** of a **linear** **bivariate** relationship.

`$$\rho = \frac{\text{cov}_{x,y}}{s_x s_y} = \frac{\sum_{i=1}^N (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\frac{\sum_{i=1}^{N} \left(x_i-\bar{x}\right)^2}{N-1}} \sqrt{\frac{\sum_{i=1}^{N} \left(y_i-\bar{y}\right)^2}{N-1}}}$$`
with

- `$\bar{x},\bar{y}$`: sample mean of `$x$`, `$y$`
- `$s_x, s_y$`: standard deviation of `$x$`, `$y$`
- `$\text{cov}_{x,y}$`: covariance of `$x$` and `$y$`
- `$N$`: number of observations

???

- a crude measure of the relationship between variables is the covariance
- if we standardize this value we get Pearson's correlation coefficient
- numerator: quantifies the sum of products of the squared deviations from the mean for x and y
- denominator: product of standard deviations of x and y

---

## Interpretation

&nbsp;

&nbsp;

&nbsp;                        | Strength (rules of thumb)
:---------------------------- | :-------------------------------------
`$\rho\approx 0$`               | **no linear correlation** (the variables are independent)
`$\text{abs}(\rho)\approx 0.1$` | **weak relationship**
`$\text{abs}(\rho)\approx 0.3$` | **moderate relationship**
`$\text{abs}(\rho)\approx 0.5$` | **strong relationship**
`$\text{abs}(\rho)\approx 1$`   | **perfect linear correlation** (the variables are dependent)

???
- correlation coefficient has to lie between -1 and 1
- sign: direction
- magnitude: strength
- rules of thumb regarding the strength of the relationship: +-0.1: low strength, +-0.3 moderate strength, +-0.5 high strength

---

## Correlation examples

Six examples of bivariate distributions with different correlation coefficients:

]
.panel[.panel-name[Code]

```r
library(purrr)
n <- 250 # Sample size
cors <- c(-0.9, -0.5, 0, 0.5, 0.9, 0.99) # correlation coefficients
dat <- map_dfr(cors, function(r) {
  # Draw random sample from bivariate normal distribution with
  # mean=0, sd=1 and covar=r
  mat_mvrnorm <- MASS::mvrnorm(n, c(0,0), matrix(c(1,r,r,1),2,2))
  tibble(r = r, x = mat_mvrnorm[, 1], y = mat_mvrnorm[, 2])
})
ggplot(dat, aes(x,y)) +
  facet_wrap(~r, labeller = as_labeller(function(x){
    latex2exp::TeX(paste0("$\\rho$ = ", x), output = "text")
  }, default = label_parsed)) + 
  geom_point() +
  geom_vline(xintercept = 0, linetype = 2) + 
  geom_hline(yintercept = 0, linetype = 2)
```

]

???

- 1, 5,6: starker linearer Zusammenhang
- 1: negativ, 5,6: positive
- 2,4: moderater linearer Zusammenhang
- 3: unkorrelierte Variablen

---

## Computing correlation coefficients with `R`

We can calculate the correlation coefficient in `R` using `stats::cor()`:

```r
cor(x, y = NULL, use = "everything", method = "pearson")
```

- `x`: numeric variable/ data frame/ matrix
- `y`: another numeric variable (optional)
- `use`: handling of missing values
  - `"everything"` returns `NA` if any of the two variables has missing values
  - `"all.obs"` returns an error if there are any missing values in the data
  - `"complete.obs"` computes correlations only from cases without missing values
  - `"pairwise.complete.obs"` computes correlations between pairs of variables only from cases without missing values for those two variables
- `method`: 
  - `"pearson"`: parametric, use for normally distributed data
  - `"spearman"`: non-parametric, use for non-normally distributed data
  - `"kendall"`: non-parametric, use for _smallish_ non-normally distributed data

???

- kendall's tau: also when you have a small dataset with a large number of tied ranks

---

## Correlation: father-son data

Pearson correlation coefficient for father-son data:

```r
rho <- 
  cor(father_son$fheight, father_son$sheight)
rho
```

```
## [1] 0.5013383
```

]

]

???

- by means of the rho we can be more precise regarding strength and direction of the relationship between fathers and sons
- 0.5 moderate linear correlation
- generally, taller fathers get taller sons, smaller fathers get smaller sons, BUT
- there are a few cases where this statement doesn't hold true
- some tall fathers get small sons and some small fathers get tall sons

---

## Use correlation for linear relationships only

&#x26A0;&#xFE0F; Correlation is not always an appropriate summary of the relationship between a pair of variables.

]

This is illustrated by the famous [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) which comprises four two-dimensional datasets with a correlation of `$\rho=0.82$` between the input features:

]

```
## # A tibble: 4 x 6
##   group x_mean  x_sd y_mean  y_sd   rho
##   <chr>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 1          9  3.32   7.50  2.03 0.816
## 2 2          9  3.32   7.50  2.03 0.816
## 3 3          9  3.32   7.5   2.03 0.816
## 4 4          9  3.32   7.50  2.03 0.817
```

]

]

???

- je 10 Instanzen
- je ein Wert für x und y
- gleiche Werte für deskriptive Merkmale wie Mittelwert, Standardabweichung

- if we plot the data, we notice that the relationship between x and y is very different across the four datasets
  - 1: relatively high linear relationship -> rho is approriate here 
  - 2: non-linear relationship 
  - 3: if we remove the outlier, we would have a perfect correlation with rho of 1
  - 4: inverse to 3: basically, there is no relationship between variables, but the single outlier is responsible for the high correlation coefficient
  - 3: outlier leads to a reduction of the correlation coefficient
  - 4: whereas in 4, the outlier leads to an increase of the correlation coefficient

---

## Spurious correlations

???

- But be careful when interpreting rho
- line chart shows an almost perfect correlation
- black line: number of suicides by hanging, strangulation and suffocation from 1999 to 2009
- red line: US spending on science space and technology
- spurious correlation: statistically correlate but one does not cause the other

- Vorsicht bei der Interpretation des Kor.koef.: In diesem Liniendiagramm wird eine fast perfekte Korrelation dargestellt, auf der einen Seite den Ausgaben der USA für Wissenschaft, Raumfahrt und Technologie, auf der anderen Seite der Anzahl von Selbstmorde durch Hängen, Erwürgen und Ersticken

- spurious correlation = Scheinkorrelation
- Ursache und Wirkung überprüfen
- Hängen, Erwürgen, Ersticken
- anderes Beispiel: Menschen, die in einem Pool ertrinken und Anzahl von Nicolas-Cage-Filmen pro Jahr

---

## Correlation vs. causation

&#x26A0;&#xFE0F; Correlation does not imply causation.

]

Possible reasons:

**Confounding/ third-variable problem**: a measured or unmeasured variable affects the results.  
Example: _Elite college degree &rarr; adult earnings_ (confounder: parental socioeconomic status)

**Direction of causality**: correlation coefficients do not imply the direction of causation.  
Example: _Chocolate consumption is correlated with acne and vice versa._
_Chocolate contains ingredients such as fats or sugar causing acne, so chocolate consumption leads to acne, but not the other way around._

???

- children in families of higher socioeconomic status are more likely to attend elite colleges and 
- children in families of higher socioeconomic status have, on average, higher earnings as adults, regardless of whether they attended an elite college.

- causation = Kausalität
- Zusammenhang kann zufällig sein
- Anwendung für Visual Analytics: Anwendungsexperte überprüft, ob Korrelationen Sinn ergeben

Übungsaufgabe: DisStaUsiR S. 242

---

## Simple Linear Regression

Goal: **predict** a **quantitative** **response** `$y$` on the basis of a single **predictor** `$x$`

Terminology:

- `$y$`: response / dependent variable
- `$x$`: predictor / independent or explanatory variable

???

- so far, we looked at how to measure relationships between two variables.
- we take this process a step further and predict one variable from another
- example: based on the height of the father, try to predict the height of his son 
- example: try to predict levels of stress from the amount of time until you have to give a talk (negative relationship -> if there's 10 min to go until someone has to give a talk, how anxious will they be)

- rho: Kenngröße für den die Stärke und Richtung von linearen Zusammenhängen
- wir sind daran interessiert anhand unserer Daten ein Modell zu lernen, das beschreibt, wie groß ein Sohn wird, gegeben der Größe des Vaters.
- Bei linearer Regression suchen wir nach der Geraden, die den Zusammenhang zwischen 2 Variablen am besten beschreibt.

unabhängig Variable = einfache lineare Regression -> eine Predictor-Variable -> generell kann eine Response-Variable durch mehrere Predictor-Variablen beschrieben werden. Für den Fall wird eine Unabhängigkeitsannahme gemacht.

---

## General statistical model

The value of the **response** is **modeled** as a **function of the predictor** (plus some additional noise).

`$$y = f(x) + \epsilon, \qquad \epsilon \sim N(0, \sigma_{\epsilon})$$`

Example: `$y = f(x) = 2x + 3$` is a function with input `$x$` (predictor) and output `$y$` (response). If `$x$` is 4, `$y$` is 11, then `$y = 2 \times 4 + 3 = 11$`

---

## Simple linear regression

**Simple** linear regression: predict a quantitative response `$Y$` on the basis of a **single predictor variable** `$X$`

Assumption: model fit is linear, i.e., the data can be summarized with a straight line.

`$$f(x) = \beta_0 + \beta_1 x$$`
`$\rightarrow$` two unknown **parameters**/**coefficients**: **intercept** `$\beta_0$` and **slope** `$\beta_1$`

???

- intercept: Achsenschnittpunkt
- slope: Anstieg

---

## Best fit line

```r
# Manually create some regression lines
df_reg_lines <- expand_grid(intercept = 82:88, slope = seq(0.4, 0.6, 0.1))
ggplot(data = father_son, aes(x = fheight, y = sheight)) + 
  geom_point() +
* geom_abline(data = df_reg_lines, aes(intercept = intercept, slope = slope), color = "orange") +
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]
.panel[.panel-name[Best fit]

```r
ggplot(data = father_son, aes(x = fheight, y = sheight)) + 
  geom_point() +
  geom_abline(data = df_reg_lines, aes(intercept = intercept, slope = slope), color = "orange") + 
* geom_smooth(method = "lm", size = 1) +
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]

```r
ggplot(data = father_son, aes(x = fheight, y = sheight)) + 
  geom_point() +
  geom_abline(data = df_reg_lines, aes(intercept = intercept, slope = slope), color = "orange") + 
  geom_smooth(method = "lm", size = 1,
*   se = FALSE) +
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]

???

- expand.grid erstellt ein data frame mit allen möglichen Kombinationen der Input-Vektoren

- geom_smooth automatically creates a linear regression and shows its line 
- gray area around the line: uncertainty of the line
- Why does this line fit the data the best?
- We need a measure to compare different models.

---

## Residuals

Fitted values: `$\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X$` &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&xrarr;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **Residuals**: `$e = Y - \hat{Y}$`

A residual quantifies the model error for an individual observation by calculating the difference between the actual response value and the predicted response value.

]

```r
model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(sheight ~ fheight, data = father_son)
father_son %>% 
  mutate(y_hat = model$fit$coefficients[1] + fheight * model$fit$coefficients[2]) %>%
  mutate(res_cat = ifelse(y_hat > sheight, "overest", "underest")) %>%
  ggplot(aes(fheight, sheight)) + 
  geom_segment(aes(xend = fheight, yend = y_hat, color = res_cat), show.legend = FALSE, size = .25) +
  geom_point() +
  annotate("text", x = 150, y = 195, label = "model underestimates true value\n(e > 0)", color = "coral", hjust = 0, size = 5) +
  annotate("text", x = 192.5, y = 150, label = "model overestimates true value\n(e < 0)", color = "cadetblue2", hjust = 1, size = 5) +
  scale_color_manual(values = c("cadetblue2", "coral")) +
  geom_smooth(method = "lm", se = FALSE, size = 1) + 
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]

???

- the term residual as term for model error comes into play
- please note this head notation
- I'm using the head notation when I refer to the regression model and its predictions
- I'm using the notation without heads when I'm referring to the actual values
- now the goal is, to estimate the values for parameters beta_0 and beta_1, such that the model error is minimized
- in linear regression, model error is difference between true value of the response and the predicted value of the response
- in this plot, it is the vertical deviation between the true values shown as point and the corresponding value on the line.

---

## Least squares fitting procedure

Goal: find the values that minimize the distance of the fitted model to the data.

`$\rightarrow$` **Least squares** (LS) equation for `$N$` observations of pairs `$(x_i, y_i)$`:

`$$\min_{\beta_0, \beta_1} RSS=\min_{\beta_0, \beta_1}  \sum_{i=1}^N e_i^2 = \min_{\beta_0, \beta_1}\sum_{i=1}^N \left( y_i - \left(\hat{\beta}_0 +   \hat{\beta}_1 x_{i}\right)\right)^2$$`

- Parameter values of the minimum are referred to as the **least squares estimates (LSE)** `$\hat{\beta}$`
- The loss function is also known as **residual sum of squares** `$RSS$`

???

- for the optimal regression line it holds true that the sum of all residuals is equal to 0 AND
- ... that the line cuts the mean values of x and y
- least squares estimates is used (Least-Squares-Schätzungen)

---

## Relationship between correlation and regression

Regression line explains that for every standard deviation `$\sigma_x$` increase above the average `$\bar{x}$`, `$y$` grows `$\rho$` standard deviations `$\sigma_y$` above the average `$\bar{y}$`:

`$$\left( \frac{y_i-\bar{y}}{\sigma_y} \right) = \rho \left( \frac{x_i-\bar{x}}{\sigma_x} \right)$$`

Example for our father-son dataset, given a father height of `$x_i$` = 160 cm:

$$
`\begin{aligned}
 \left( \frac{\hat{y_i}-174.46}{7.15} \right) &= 0.5013 \cdot \left( \frac{160-171.93}{6.97} \right) \\
 \hat{y_i} &= 0.5013 \cdot \left( \frac{160-171.93}{6.97} \right)\cdot 7.15 + 174.46\\
 \hat{y_i} &= 168.33
\end{aligned}`
$$

---

## Tidymodels

]

&nbsp;

"The **tidymodels** framework is a **collection of packages** for **modeling and machine learning** using **tidyverse** principles."

]

&ndash; <https://www.tidymodels.org/>

]

---

## Linear regression in `R`

```r
library(tidymodels)
linear_reg()
```

```
## Linear Regression Model Specification (regression)
```

]

```r
linear_reg() %>%
  set_engine("lm")
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

]

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(formula = sheight ~ fheight, data = father_son) -> model
model
```

```
## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:
## stats::lm(formula = sheight ~ fheight, data = data)
## 
## Coefficients:
## (Intercept)      fheight  
##     86.0720       0.5141
```

`response ~ predictor1 + predictor2 + ...` .font80[&nbsp;&nbsp;(Note that we don't use quotation marks here.)]

]

???

- `formula`: a formula with notation `outcome ~ predictor`
- `~` (tilde) operator means "predicted from"
- equation notation: Response on left side of the tilde and predictor on right sided of the tilde
- very flexible -> if we had multiple predictors, we could include them in the equation by using + sign
- OR log transform predictor or response -> log left parenthesis  name of predictor right parenthesis
- `data`: name of the data frame which contains the outcome and predictor

Use `as.formula()` to convert a string to a formula object.

]

---

## Model output

`$$\widehat{sheight_i} = 86.0720 + 0.5141 \cdot fheight_i$$`

#### Interpretation:

- **Slope**: For each additional cm the father is taller, the son is expected to be taller, on average, by 0.5141 cm.
- **Intercept**: A father who is 0 cm tall is expected to have a son that is, on average, 86.0720 cm tall.

]

For simple linear regression, the relationship between a regression line's slope  `$\beta_1$` and the Pearson correlation coefficient `$\rho$` between the explanatory variable and the response is: `$\beta_1=\rho \cdot \frac{s_y}{s_x}$`.

]

???

- Interpretation der Koeffizienten: 
  - positive slope: equal to rho, the taller the father, the taller the son
  - value = 0.51: If the father was 1 cm taller, then the son would be  0.51 cm taller.
  - If `$x$` and `$y$` are standardized, the regression line has intercept 0, but the same slope.

---

## Extrapolation

&#x26A0;&#xFE0F; Be careful when interpreting linear model predictions for observations beyond the range of the data it was built on.

]

???

- comic nicely illustrates that in most cases, linear regression models are useful for the value ranges they are learned on.

---

## Assessing model diagnostics with the `broom` package

The `broom` package provides functions to convert model diagnostics into _tidy_ data frames.

- `tidy()` returns for each coefficient a row with its value, the standard error, `$t$`-statistic and p-value.
- `augment()` adds columns to the original dataset from the model fit, e.g. residuals and influence statistics.
- `glance()` returns a row with summary statistics, e.g. `$R^2$`, degrees of freedom, AIC.

]

]

```r
library(broom) # included in tidymodels
```

See `vignette("broom")` for more information.

]

???

- three weeks ago -> tidyverse and its data frame centric functions
- lm is a very old function and produces an object of a class dedicated for linear regression
- broom provides some functions to convert the output of lm() function to a data frame
- e.g. augment, ...
- fheight und sheight: original values
- new variables whose names start with a dot . (diagnostic values)
- .fitted: predicted values of our model
- .se.fit: standard error of the fitted values
- .resid: residuals
- .hat: Leverage/hat values
- .sigma: residual standard deviation, Residuenstandardabweichung für den Fall dass die Beobachtung vom Modell entfernt wird
- .cooksd: Cook's distance
- .std.resid: Standardisiertes Residuum (.resid/standard deviation of residuals)

---

## Tidy model output

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(formula = sheight ~ fheight, data = father_son) %>%
* tidy()
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   86.1      4.65        18.5 1.60e-66
## 2 fheight        0.514    0.0270      19.0 1.12e-69
```

`$$\widehat{sheight_i} = 86.0720 + 0.5141 \cdot fheight_i$$`

???

- one row for intercept, one row for explanatory variable
- estimate: beta coefficients / model parameters
- std.error: uncertainty around the estimates
- statistic: use estimate and std.error to calculate significance of relationship
- p.value: standardized value of the statistic between 0 and 1

---

## Assessing the model fit

A simple **baseline model** is the mean of the response.

```r
(y_hat <- mean(father_son$sheight))
```

```
## [1] 174.4575
```

The sum of squared differences between the observed values and the values predicted by the mean is called **total sum of squares `$TSS$`**.

]
.panel[.panel-name[RSS]

The sum of squared differences between the observed values and the values predicted by the linear regression model is called **residual sum of squares `$RSS$`**.

]
.panel[.panel-name[MSS]

The improvement in prediction between the regression model and the mean is shown in the **model sum of squares `$MSS$`**. It is the sum of squared differences between the predicted values of the regression model and the mean.
The value of `$MSS$` is large when the regression model is very different to the mean prediction.

]
.panel[.panel-name[R²]

The proportion of improvement due to the regression model is **the coefficient of determination `$R^2$`**. It measures the **share of variability in the response that is captured by the regression model**.

`$$R^2=\frac{MSS}{TSS}=1-\frac{RSS}{TSS}$$`

```r
glance(model)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
## 1     0.251         0.251  6.19      361. 1.12e-69     1 -3494. 6993. 7008.   41213.
## # ... with 2 more variables: df.residual <int>, nobs <int>
```

`$R^2$` = 0.25 &rarr; Only 25% of the variation in son height can be explained by father height alone.

It can be concluded that there are other factors that influence the height of sons (possibly mother height, genetic markers, etc.).

]

???

TSS:

- we know now that this line is optimal w.r.t to the RSS, so it is better in relation to other lines 
- useful to have a baseline
- most basic model is the mean of the response
- response is independent from predictor
- regardless of how tall the father is we will always predict 174.5

RSS:

- 25% of the variability of the sons' heights are captured by this line
- square root of pearson correlation coefficient is R^2 
- sum((model$fit$fitted.values - mean(father_son$sheight))^2)/sum((father_son$sheight - mean(father_son$sheight))^2)
-F-ratio? S. 252

## `augment()`

```r
augment(model$fit)
```

```
## # A tibble: 1,078 x 8
##    sheight fheight .fitted .resid     .hat .sigma  .cooksd .std.resid
##      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>      <dbl>
##  1    152.    165.    171. -19.2  0.00179    6.16 0.00860      -3.10 
##  2    161.    161.    169.  -8.10 0.00335    6.19 0.00289      -1.31 
##  3    161.    165.    171. -10.0  0.00185    6.18 0.00242      -1.62 
##  4    159.    167.    172. -12.4  0.00139    6.18 0.00281      -2.01 
##  5    163.    155.    166.  -2.63 0.00621    6.19 0.000568     -0.426
##  6    163.    160.    168.  -5.19 0.00361    6.19 0.00128      -0.840
##  7    163.    166.    171.  -8.66 0.00159    6.19 0.00156      -1.40 
##  8    163.    164.    171.  -8.04 0.00201    6.19 0.00170      -1.30 
##  9    164.    168.    172.  -8.22 0.00125    6.19 0.00111      -1.33 
## 10    163.    170.    174. -11.0  0.000991   6.18 0.00157      -1.78 
## # ... with 1,068 more rows
```

- `.fitted`: predicted values
- `.resid`: residuals
- `.hat`: Leverage values (degree of outlierness)
- `.sigma`: residual standard deviation (impact on model fit)
- `.cooksd`: Cook's distance (effect of deleting a given observation)
- `.std.resid`: standardized residuals (.resid/standard deviation of residuals)

---

## Statistical significance of fitted coefficients

The `$t$`-**statistic** tests the null hypothesis that `$\beta_1$` is 0.
The `$t$`-statistic is based on the comparison between the estimated `$\beta_1$` value and the amount of error in that estimate.

`$$\begin{align}t&=\frac{\hat{\beta_1}-\hat{\beta_1}^{expected}}{SE(\hat{\beta_1})}\\&=\frac{\hat{\beta_1}}{SE(\hat{\beta_1})}\end{align}$$`

???

- to test the 0h, we check, whether our estimate for beta_1 is sufficiently different from 0
- nominator: beta1_hat - beta1_expected_hat
- beta-expected term is simply the value of beta that we would expect to obtain if the null hypothesis were true -> can be replaced by 0
- this depends on the accuracy of our estimate, our standard error
- The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across cases.
- t-value is high if our estimate for beta_1 is high and standard error is low
- if t is very large then it is unlikely to have occurred when there is no effect

- values of t have a special distribution that differs according to the degrees of freedom for the test. In regression , the degrees of freedom are n-p-1-> simple linear regression n-2

- Um die 0-Hypothese zu testen, müssen wir überprüfen, ob unsere Schätzung für beta_1 ausreichend weit entfernt von 0 ist, um zu bestätigen, dass beta_1 ungleich 0 ist.
- hängt von Genauigkeit der Schätzung ab, also Standardfehler. Ein kleiner Standardfehler ist ein Anzeichen dafür das beta_1 unglich 0 ist.
- dazu berechnen wir den p-Wert anhand der t-Verteilung mit n-2 Freiheitsgeraden (n - 2 Koeffizienten).
- 0-Hypothese ablehnen, wenn p-Wert kleiner als Signifikanzniveau ist

---

## Predictions

Make predictions on **out-of-sample** observation:

```r
predict(model$fit, newdata = data.frame(fheight = 180))
```

```
##        1 
## 178.6087
```

using `broom`:

```r
augment(model$fit, newdata = data.frame(fheight = 180))
```

```
## # A tibble: 1 x 2
##   fheight .fitted
##     <dbl>   <dbl>
## 1     180    179.
```

???

- Ausgabe als Vektor vs. Data Frame

---

## Multiple linear regression

General model:

$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots +  \beta_i X_p + \epsilon$$

Predicted response equation:

$$ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \ldots +  \hat{\beta}_i X_p$$

???

- Response kann von mehreren Größen abhängig sein -> z. B. Größe des Sohns ist abhängig von Größe des Vaters und der Mutter

---

## Ames Iowa Housing Dataset

&nbsp;

> "Data set contains information from the Ames Assessor’s Office used in 
computing assessed values for individual residential properties sold in Ames, 
IA from 2006 to 2010." &mdash; 
[Dataset documentation](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt)

&nbsp;

De Cock, Dean. "Ames, Iowa: Alternative to the Boston housing data as an end of 
semester regression project." Journal of Statistics Education 19.3 (2011). 
[URL](http://jse.amstat.org/v19n3/decock.pdf)

]

```r
library(AmesHousing)
ames <- make_ames() %>% # prepares dataset
  select(-matches("Qu")) # remove quality columns
glimpse(ames)
```

```
## Rows: 2,930
## Columns: 74
## $ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_~
## $ MS_Zoning          <fct> Residential_Low_Density, Residential_High~
## $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, ~
## $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, ~
## $ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave,~
## $ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alle~
## $ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irr~
## $ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, L~
## $ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, A~
## $ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, I~
## $ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G~
## $ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North~
## $ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm~
## $ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm,~
## $ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, O~
## $ House_Style        <fct> One_Story, One_Story, One_Story, One_Stor~
## $ Overall_Cond       <fct> Average, Above_Average, Above_Average, Av~
## $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001,~
## $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001,~
## $ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable~
## $ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompS~
## $ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, Vinyl~
## $ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, Vinyl~
## $ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace~
## $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, ~
## $ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typic~
## $ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PC~
## $ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical,~
## $ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, N~
## $ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, G~
## $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3,~
## $ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, U~
## $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017,~
## $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 12~
## $ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA,~
## $ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, ~
## $ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,~
## $ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,~
## $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 12~
## $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, ~
## $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, ~
## $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,~
## $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,~
## $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,~
## $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2,~
## $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5,~
## $ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T~
## $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,~
## $ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, A~
## $ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, R~
## $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 6~
## $ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typic~
## $ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Pa~
## $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 14~
## $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84,~
## $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, ~
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0~
## $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Po~
## $ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_F~
## $ Misc_Feature       <fct> None, None, Gar2, None, None, None, None,~
## $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500,~
## $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2,~
## $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010,~
## $ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , W~
## $ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, N~
## $ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 1~
## $ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.6173~
## $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 4~
```

]

???

- 2930 observations, 74 variables
- remove quality columns: why?

---

## Simple linear regression

The plot displays the relationship between ground floor living area and sale price of properties in Ames, Iowa.

Is ground floor living area alone _sufficient_ to estimate a property's sale price?

]

---

## Categorical predictor with two levels

```r
ames <- ames %>%
  mutate(has_fireplace = 
           as.factor(Fireplaces > 0))
ames %>% 
  select(Sale_Price, Fireplaces, has_fireplace) %>%
  print(n = 15)
```

```
## # A tibble: 2,930 x 3
##    Sale_Price Fireplaces has_fireplace
##         <int>      <int> <fct>        
##  1     215000          2 TRUE         
##  2     105000          0 FALSE        
##  3     172000          0 FALSE        
##  4     244000          2 TRUE         
##  5     189900          1 TRUE         
##  6     195500          1 TRUE         
##  7     213500          0 FALSE        
##  8     191500          0 FALSE        
##  9     236500          1 TRUE         
## 10     189000          1 TRUE         
## 11     175900          1 TRUE         
## 12     185000          0 FALSE        
## 13     180400          1 TRUE         
## 14     171500          1 TRUE         
## 15     212000          0 FALSE        
## # ... with 2,915 more rows
```

]

- `has_fireplace = FALSE`: property does not have a fireplace 
- `has_fireplace = TRUE`: property has one or more fireplaces

]

---

&nbsp;

How, if at all, does the relationship between ground floor living area and price of Ames properties vary by whether or not they have a fireplace?

]

]

```r
ggplot(data = ames, aes(x = Gr_Liv_Area, y = Sale_Price/1000, color = has_fireplace)) + 
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm") +
  scale_color_brewer(palette = "Set1", direction = -1) +
  labs(x = "Ground floor living area [sq ft]", y = "Price [x 1,000 US$]", 
       title = "Ground floor square feet vs. sale price, by fireplace")
```

]

---

## Multiple linear regressions

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area + has_fireplace, data = ames) %>%
  tidy()
```

```
## # A tibble: 3 x 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)        17970.    3166.        5.68 1.52e-  8
## 2 Gr_Liv_Area           97.0      2.22     43.7  5.60e-322
## 3 has_fireplaceTRUE  33714.    2243.       15.0  3.03e- 49
```

#### Interpretation

- Slope of `Gr_Liv_Area`: _all else held constant_, for each additional square feet a property's ground floor living area is larger, we would expect the sale price to be higher, on average, by ca. 97 US$. 
- Slope of `has_fireplaceTRUE`: _all else held constant_, a property with a fire place is, on average, by 33,714 US$ more expensive.
- Intercept: a property with 0 sq ft ground floor living area and without fire places is expected to cost ca. 17,970 US$.

]

---

## Incorporating model complexity

Recall the coefficient of determination: `$R^2=\frac{MSS}{TSS}$`

This measure has an issue: `$R^2$` increases with each additional variable.

]

### Occam's razor or the principle of parsimony

Given two models of equal quality, the simpler model is preferred over the more complex model.

]

&rarr; Add a variable to the model _only_ if its addition increases model quality (predictive power) by a _substantial degree_

---

## Adjusted `$R^2$`

The **Adjusted** `$R^2$` contains a term penalizing the addition of each further variable to the model:

$$
`\begin{aligned}
\text{Adjusted } R^2 &= R^2\cdot\frac{n-1}{n-p-1}\\
&=\frac{MSS}{TSS}\cdot\frac{n-1}{n-p-1}
\end{aligned}`
$$

- `$n$`: number of observations
- `$p$`: number of predictors / explanatory variables

???

adjustiertes Bestimmtheitsmaß

- zusätzlicher Term wird größer mit zunehmender Anzahl von Prädiktor-Variablen

---

## Adjusted `$R^2$`

```r
linear_reg() %>% set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area, data = ames) %>% glance()
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC deviance
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
## 1     0.500         0.499 56524.     2923.       0     1 -36218. 72442. 72460.  9.35e12
## # ... with 2 more variables: df.residual <int>, nobs <int>
```

```r
linear_reg() %>% set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area + has_fireplace, data = ames) %>% glance()
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC deviance
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
## 1     0.535         0.535 54471.     1687.       0     2 -36109. 72226. 72250.  8.68e12
## # ... with 2 more variables: df.residual <int>, nobs <int>
```

The addition of `has_fireplace` leads to a better model, both in terms of `$R^2$` and adjusted `$R^2$`.

]

Does the addition of `Year_Sold` to the model lead to a better fit?

]

```r
linear_reg() %>% set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area, data = ames) %>% glance()
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC  deviance
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>     <dbl>
## 1   0.49954       0.49937 56524.    2922.6       0     1 -36218. 72442. 72460. 9.3549e12
## # ... with 2 more variables: df.residual <int>, nobs <int>
```

```r
linear_reg() %>% set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area + Year_Sold, data = ames) %>% glance()
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC  deviance
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>     <dbl>
## 1   0.49968       0.49934 56526.    1461.6       0     2 -36217. 72443. 72467. 9.3523e12
## # ... with 2 more variables: df.residual <int>, nobs <int>
```

The addition of `Year_Sold` leads to a model with slightly better `$R^2$` (0.4997 > 0.4995). 
However, since the adjusted `$R^2$` value is lower (0.49934 < 0.49937), the simpler model should be preferred.  
Adjusted `$R^2$` doesn't increase because `Year_Sold` does not provide enough new information, or, in other words, is nearly unrelated.

]

---

## Main vs. interaction effects

Suppose we want to predict property prices from their living area and presence of a fire place.

Which of the following models is more appropriate?

1. **Main effects:** The rate at which price changes as living area increases is the same for properties with fire places and properties without fire places.
2. **Interaction effects:** The rate at which price changes as living area increases differs between properties with fire places and properties without fire places.

]

---

---

## Model with main effects

```r
m1 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area + has_fireplace, data = ames)
m1
```

```
## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area + has_fireplace, 
##     data = data)
## 
## Coefficients:
##       (Intercept)        Gr_Liv_Area  has_fireplaceTRUE  
##             17970                 97              33714
```

`$$\widehat{Sale\_Price} = 17,970 + 97 \cdot Gr\_Liv\_Area + 33,714 \cdot has\_fireplace$$`

---

## Model with interaction effects

```r
(m2 <- linear_reg() %>% set_engine("lm") %>%
  fit(formula = Sale_Price ~ Gr_Liv_Area * has_fireplace, data = ames))
```

```
## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area * has_fireplace, 
##     data = data)
## 
## Coefficients:
##                   (Intercept)                    Gr_Liv_Area  
##                      57113.38                          66.19  
##             has_fireplaceTRUE  Gr_Liv_Area:has_fireplaceTRUE  
##                     -31309.47                          45.90
```

$$
`\begin{align}
\widehat{Sale\_Price} &= 57,113 &+ 66 \cdot Gr\_Liv\_Area &-31,309 \cdot has\_fireplace &+ 46 \cdot Gr\_Liv\_Area \cdot has\_fireplace\\\\
\widehat{Sale\_Price}(has\_fireplace = 1) &= 57,113 &+ 66 \cdot Gr\_Liv\_Area &-31,309 \cdot 1 &+ 46 \cdot Gr\_Liv\_Area \cdot 1\\
&= 25,804 &+ 112 \cdot Gr\_Liv\_Area & & \\\\
\widehat{Sale\_Price}(has\_fireplace = 0) &= 57,113 &+ 66 \cdot Gr\_Liv\_Area &-31,309 \cdot 0 &+ 46 \cdot Gr\_Liv\_Area \cdot 0\\
&= 57,113 &+ 66 \cdot Gr\_Liv\_Area & & \\
\end{align}`
$$

]

---

## Model selection

```r
glance(m1)
```

```r
glance(m2)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC deviance
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
## 1     0.550         0.550 53594.     1194.       0     3 -36061. 72132. 72162.  8.40e12
## # ... with 2 more variables: df.residual <int>, nobs <int>
```

Including the interaction effect of `Gr_Liv_Area` and `has_fireplace` yields a better model.

]

```r
glance(m2)$adj.r.squared > glance(m1)$adj.r.squared
```

```
## [1] TRUE
```

---

## Models with categorical predictors

```r
janitor::tabyl(ames, Bldg_Type)
```

```
##  Bldg_Type    n    percent
##     OneFam 2425 0.82764505
##   TwoFmCon   62 0.02116041
##     Duplex  109 0.03720137
##      Twnhs  101 0.03447099
##     TwnhsE  233 0.07952218
```

The explanatory variable `Bldg_Type` is a factor with five levels.

Multinominal predictors, i.e., variables with more than two categories, will be automatically encoded as **dummy variables**.

]

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Dummy variables</div></th>
</tr>
  <tr>
   <th style="text-align:left;"> Bldg_Type </th>
   <th style="text-align:center;"> TwoFmCon </th>
   <th style="text-align:center;"> Duplex </th>
   <th style="text-align:center;"> Twnhs </th>
   <th style="text-align:center;"> TwnhsE </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> OneFam </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> TwoFmCon </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: lime !important;">1</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Duplex </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: lime !important;">1</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Twnhs </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: lime !important;">1</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> TwnhsE </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: #eeeeee !important;">0</span> </td>
   <td style="text-align:center;"> <span style="     padding-right: 4px; padding-left: 4px; background-color: lime !important;">1</span> </td>
  </tr>
</tbody>
</table>

Note that only four dummy variables are created. The building type of properties with _baseline category_ `OneFam` can be inferred from the other dummy variables.

]

```r
linear_reg() %>%
  set_engine("lm") %>%
  fit(Sale_Price ~ Gr_Liv_Area + Bldg_Type, data = ames) %>%
  tidy()
```

```
## # A tibble: 6 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         11935.   3223.        3.70 2.17e- 4
## 2 Gr_Liv_Area           114.      2.00     57.0  0.      
## 3 Bldg_TypeTwoFmCon  -59300.   6951.       -8.53 2.29e-17
## 4 Bldg_TypeDuplex    -60746.   5299.      -11.5  8.58e-30
## 5 Bldg_TypeTwnhs     -18059.   5515.       -3.27 1.07e- 3
## 6 Bldg_TypeTwnhsE     27477.   3723.        7.38 2.06e-13
```

]

]

---

## Further observation-level diagnostics

.content-box-yellow[
- For which properties does the model have difficulty estimating the sale price?
- Which properties are **outliers** with respect to the sample distribution? 
- Which properties have a high **influence** on the model?

]

]

```r
(augmented_mod <- augment(model$fit))
```

---

## Properties with high residuals

Which son heights are difficult to predict?

Residuals are measured in the same units as the response. 
To obtain **standard residuals**, we divide the residuals by their standard deviation.

]
]

]
.panel[.panel-name[Code]

```r
ggplot(augmented_mod, aes(x = fheight, y = sheight)) + 
  geom_point(aes(color = .resid)) +
  geom_smooth(method = "lm") +
  scale_color_gradient2(mid = "gray80") +
  labs(x = "Father's height [cm]", y = "Son's height [cm]", color = "Residuals")
```

]
]

???

- Punkte sind sehr weit vom Regressionsgerade entfernt
- in einem interaktiven Tool könnte der Experte entscheiden, was mit diesen Beobachtungen gemacht wird.

---

## High leverage properties

<a href="https://en.wikipedia.org/wiki/Leverage_(statistics)">Leverage</a> (or **hat values**) measures the outlierness of an observation based on the difference to the general distribution of the sample.

The range of hat values is between 0 (very close to other observations) and 1 (far away from other observations).

]

]
.panel[.panel-name[Code]

```r
ggplot(augmented_mod, aes(fheight, sheight)) +
* geom_point(aes(color = .hat), alpha = 0.6) +
  geom_smooth(method = "lm") +
  scale_color_viridis_c(option = "inferno") +
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]
]

???

- observations with high leverage could influence the regression mode.
- a popular measure to calculate the effect of an observation on a regression model is cook's distance

---

## "Influential" properties

**Cook's distance** measures the effect of removing a given observation on the change of the model.

As a rule of thumb, Cook's distance values greater than 1 indicate observations with a high influence on the model.

In general, for observations with high Cook distance, one must weigh whether (a) they represent natural variation and should therefore be retained, or (b) they are outliers that are better removed before fitting

]

]
.panel[.panel-name[Code]

```r
ggplot(augmented_mod, aes(fheight, sheight)) +
* geom_point(aes(color = .cooksd), alpha = 0.6) +
  geom_smooth(method = "lm") +
  scale_color_viridis_c(option = "inferno") +
  labs(x = "Father's height [cm]", y = "Son's height [cm]")
```

]
]

???

- question: what to do with these points?
- do they reflect natural variation across observations? (keep) 
- or could it be that they are noise or measuring errors (delete)

---

## Assumptions of linear regression

The list of assumptions for linear regression models includes:

- **linearity**, i.e., "straight-line-relationship" between predictors and response
- **homoscedasticity**, i.e., constant variance of residual terms
- no perfect **collinearity** (perfect linear relationship between two or more predictors)
- predictors are uncorrelated with variables that are not included in the model
- **outliers**, **high-leverage** points

???

**TODO** modify from here

- external variables: variebles that haven't been included in the regression model which influence the outcome variable -> if external variables do correlate with the predictors, then the conclusions we draw from the model become unreliable (because other variables exist that can predict the outcome just as well)
-

---

## Considerations for high-dimensional datasets 1/2

Two popular methods for datasets with a high number of predictor variables are **Ridge** regression and **LASSO** regression. They introduce a **shrinkage** term in their optimization functions.

Standard least squares estimate minimizes:

`$$RSS= \sum_{i=1}^n \left( y_i - \left(\hat{\beta}_0 +   \hat{\beta}_1 x_{i}\right)\right)^2$$`

**Ridge regression**: penalizes high `$\hat{\beta}$` values

`$$RSS= \sum_{i=1}^n \left( y_i - \left(\hat{\beta}_0 +   \hat{\beta}_1 x_{i}\right)\right)^2+\lambda\sum_{j=1}^p \beta_j^2=RSS+\lambda\sum_{j=1}^p \beta_j^2$$`

&rarr; implemented in `glmnet::glmnet()`: set `alpha=0`

Advantage: **more stable than least squares**  when the number of predictors `$p$` is almost as large as the number of observations `$n$`

???

- penalty term that shrinks coefficients to 0
- the higher lambda -> the more impact the penalty term has
- lambda is tuning parameter -> must be optimized eg. by cross-validation
- lse: problem: if low bias but high variance -> small change in the training data can cause a large change in the least squares coefficient
- don't say: bias-variance-trade-off: the higher lambda, the higher bias and lower variance
- ridge regression does have one obvious disadvantag: shrinks all coefficients towards 0, but it will not set any of them exactly to zero
- causes problems for interpretation

---

## Considerations for high-dimensional datasets 2/2

**Ridge regression**: penalizes high `$\hat{\beta}$` values

`$$RSS= \sum_{i=1}^n \left( y_i - \left(\hat{\beta}_0 +   \hat{\beta}_1 x_{i}\right)\right)^2+\lambda\sum_{j=1}^p \beta_j^2=RSS+\lambda\sum_{j=1}^p \beta_j^2$$`

**LASSO regression**: penalizes high `$\hat{\beta}$` values

`$$RSS= \sum_{i=1}^n \left( y_i - \left(\hat{\beta}_0 +   \hat{\beta}_1 x_{i}\right)\right)^2+\lambda\sum_{j=1}^p |\beta_j|=RSS+\lambda\sum_{j=1}^p |\beta_j|$$`

&rarr; implemented in `glmnet::glmnet()`: set `alpha=1`

Advantage: forces some coefficients to be **exactly equal to 0** which means that the model is easier to interpret

???

- l1 penalty instead of l2 penalty

---

## Other regression models in `R`

- **Generalized linear models** (`stats::glm()`): allow for discrete responses, e.g. logistic regression
- **Generalized additive models** (`mgcv::gam()`): allow for non-linear transformation of predictors
- **Robust linear models** (`MASS::rlm()`): distance function reduces sensitivity to outliers

???

- Generalised linear models, e.g. stats::glm(). Linear models assume that the response is continuous and the error has a normal distribution. Generalised linear models extend linear models to include non-continuous responses (e.g. binary target variables). They work by defining a distance metric based on the statistical idea of likelihood.
- Generalised additive models, e.g. mgcv::gam(), erweitern generaliserte lineare Modelle, sodass auch nicht-lineare Beziehungen abgebildet werden können. 
- RLM have a different notation for the distance between objects:
  - points, that are far away from the mean receive a lower weight -> downweighted
  - more robust when  outliers are present
  - (at the cost of being not quite as good when there are no outliers)

---

## Summary

- regression is a method to **predict values of one variable (response) from one or more other variables (predictors)**
- in **simple linear regression**, a statistical model is fitted to the data in form of a **straight line**
- this **line minimizes the sum of squared differences between the predicted values of the response and the predictor** 
- the **coefficient of determination `$R^2$`** indicates **how much variance is explained by the model**
- `$\beta_1$` value indicates the slope of the regression line and thus, the **direction** and **strength** of the relationship between a predicotr and the response
- pros:
  - often (suprisingly) competetitive in comparison with more sophisticated methods
  - model itself is **interpretable** &rarr; useful for inference
- cons:
  - strong assumptions, e.g., absence of collinearity
  - in terms of accuracy inferior to state-of-the-art algorithms

???

- addition R^2: it is the proportion of variance in the response that is shared by the predictor variables
- If beta_1 is significant, then the predictor significantly predicts the outcome variable.

---

## Session info

```
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language EN                          
##  collate  English_United States.1252  
##  ctype    English_United States.1252  
##  tz       Europe/Berlin               
##  date     2021-05-06
```

]

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> package </th>
   <th style="text-align:left;"> version </th>
   <th style="text-align:left;"> date </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> airports </td>
   <td style="text-align:left;"> 0.1.0 </td>
   <td style="text-align:left;"> 2020-06-29 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AmesHousing </td>
   <td style="text-align:left;"> 0.0.4 </td>
   <td style="text-align:left;"> 2020-06-23 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> broom </td>
   <td style="text-align:left;"> 0.7.6 </td>
   <td style="text-align:left;"> 2021-04-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.5) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> cherryblossom </td>
   <td style="text-align:left;"> 0.1.0 </td>
   <td style="text-align:left;"> 2020-06-25 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dials </td>
   <td style="text-align:left;"> 0.0.9 </td>
   <td style="text-align:left;"> 2020-09-16 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dplyr </td>
   <td style="text-align:left;"> 1.0.5 </td>
   <td style="text-align:left;"> 2021-03-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> forcats </td>
   <td style="text-align:left;"> 0.5.1 </td>
   <td style="text-align:left;"> 2021-01-27 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ggplot2 </td>
   <td style="text-align:left;"> 3.3.3 </td>
   <td style="text-align:left;"> 2020-12-30 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> infer </td>
   <td style="text-align:left;"> 0.5.4 </td>
   <td style="text-align:left;"> 2021-01-13 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> kableExtra </td>
   <td style="text-align:left;"> 1.3.4 </td>
   <td style="text-align:left;"> 2021-02-20 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> modeldata </td>
   <td style="text-align:left;"> 0.1.0 </td>
   <td style="text-align:left;"> 2020-10-22 </td>
   <td style="text-align:left;"> CRAN (R 4.0.5) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> openintro </td>
   <td style="text-align:left;"> 2.0.0 </td>
   <td style="text-align:left;"> 2020-07-03 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> parsnip </td>
   <td style="text-align:left;"> 0.1.5 </td>
   <td style="text-align:left;"> 2021-01-19 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> patchwork </td>
   <td style="text-align:left;"> 1.1.1 </td>
   <td style="text-align:left;"> 2020-12-17 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
</tbody>
</table>

]

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> package </th>
   <th style="text-align:left;"> version </th>
   <th style="text-align:left;"> date </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> purrr </td>
   <td style="text-align:left;"> 0.3.4 </td>
   <td style="text-align:left;"> 2020-04-17 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> readr </td>
   <td style="text-align:left;"> 1.4.0 </td>
   <td style="text-align:left;"> 2020-10-05 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> recipes </td>
   <td style="text-align:left;"> 0.1.15 </td>
   <td style="text-align:left;"> 2020-11-11 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rsample </td>
   <td style="text-align:left;"> 0.0.9 </td>
   <td style="text-align:left;"> 2021-02-17 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> scales </td>
   <td style="text-align:left;"> 1.1.1 </td>
   <td style="text-align:left;"> 2020-05-11 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> stringr </td>
   <td style="text-align:left;"> 1.4.0 </td>
   <td style="text-align:left;"> 2019-02-10 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tibble </td>
   <td style="text-align:left;"> 3.1.0 </td>
   <td style="text-align:left;"> 2021-02-25 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidymodels </td>
   <td style="text-align:left;"> 0.1.2 </td>
   <td style="text-align:left;"> 2020-11-22 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidyr </td>
   <td style="text-align:left;"> 1.1.3 </td>
   <td style="text-align:left;"> 2021-03-03 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tidyverse </td>
   <td style="text-align:left;"> 1.3.0 </td>
   <td style="text-align:left;"> 2019-11-21 </td>
   <td style="text-align:left;"> CRAN (R 4.0.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> tune </td>
   <td style="text-align:left;"> 0.1.2 </td>
   <td style="text-align:left;"> 2020-11-17 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> usdata </td>
   <td style="text-align:left;"> 0.1.0 </td>
   <td style="text-align:left;"> 2020-06-30 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> workflows </td>
   <td style="text-align:left;"> 0.2.2 </td>
   <td style="text-align:left;"> 2021-03-10 </td>
   <td style="text-align:left;"> CRAN (R 4.0.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> yardstick </td>
   <td style="text-align:left;"> 0.0.7 </td>
   <td style="text-align:left;"> 2020-07-13 </td>
   <td style="text-align:left;"> CRAN (R 4.0.3) </td>
  </tr>
</tbody>
</table>

]

</div>

---

# Thank you! Questions?

&nbsp;