class: title-slide, center, bottom # 07 - Machine Learning with tidymodels ## Data Science with R · Summer 2021 ### Uli Niemann · Knowledge Management & Discovery Lab #### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/) .courtesy[📷 Photo courtesy of Ulrich Arendt] --- ## tidymodels <img src="figures//07-tidymodels-workflow.png" width="100%" /> <!-- ## tidymodels ecosystem --> ??? tidymodels is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse. R is free, open source and provides a high flexibility in terms of how things ca be implemented -> large community of developers with different backgrounds and different design philosophies -> inconsistent syntax API of modeling packages. It provides a unified interface to various predictive modeling packages with a consistent syntax. similarly to tidyverse, multiple small packages for dedicated subtasks instead of one single huge package Today, we will cover the packages - parsnip: general API to modeling and analysis functions - rsample: resampling data: holdout validation, cross-validation, bootstrap validation - yardstick: model evaluation metrics (accuracy, RMSE) - tune: hyperparameter optimization - workflows: combine pre-processing steps and models into single objects - recipes: data preprocessing: feature engineering, imputation, etc - dials? has tools to create and manage values of tuning parameters. --- class: bottom, center background-image: url("figures/07-caret-obs.png") background-size: contain ??? tidymodels is the official successor of caret, also from the same author, Max Kuhn. --- background-image: url("https://raw.githubusercontent.com/mlr-org/mlr3/master/man/figures/mlr3verse.svg?sanitize=true") background-size: contain .footnote[Figure source: <https://mlr3.mlr-org.com/>] --- class: middle This tutorial is a condensed version of the 2-day workshop ["Introduction to Machine Learning with the Tidyverse"](https://conf20-intro-ml.netlify.app/) held by Dr. Alison Hill at the [rstudio::conf 2020](https://rstudio.com/conference/). <iframe src="https://conf20-intro-ml.netlify.app/" width="100%" height="450px"></iframe> --- ## Setup ```r library(tidyverse) library(tidymodels) ``` ``` ## -- Attaching packages ------------------------------------------------ tidymodels 0.1.2 -- ``` ``` ## v broom 0.7.6 v recipes 0.1.15 ## v dials 0.0.9 v rsample 0.0.9 ## v infer 0.5.4 v tune 0.1.2 ## v modeldata 0.1.0 v workflows 0.2.2 ## v parsnip 0.1.5 v yardstick 0.0.7 ``` ``` ## -- Conflicts --------------------------------------------------- tidymodels_conflicts() -- ## x scales::discard() masks purrr::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x kableExtra::group_rows() masks dplyr::group_rows() ## x dplyr::lag() masks stats::lag() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() ``` --- ## Ames Iowa Housing Dataset .left-column[ > "Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010." — [Dataset documentation](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) .font80[ De Cock, Dean. "Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project." Journal of Statistics Education 19.3 (2011). [URL](http://jse.amstat.org/v19n3/decock.pdf) ] ] .right-column[ ```r library(AmesHousing) (ames <- make_ames() %>% select(-matches("Qu"))) ``` ``` ## # A tibble: 2,930 x 74 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape ## <fct> <fct> <dbl> <int> <fct> <fct> <fct> ## 1 One_Story_1~ Resident~ 141 31770 Pave No_A~ Slightly~ ## 2 One_Story_1~ Resident~ 80 11622 Pave No_A~ Regular ## 3 One_Story_1~ Resident~ 81 14267 Pave No_A~ Slightly~ ## 4 One_Story_1~ Resident~ 93 11160 Pave No_A~ Regular ## 5 Two_Story_1~ Resident~ 74 13830 Pave No_A~ Slightly~ ## 6 Two_Story_1~ Resident~ 78 9978 Pave No_A~ Slightly~ ## 7 One_Story_P~ Resident~ 41 4920 Pave No_A~ Regular ## 8 One_Story_P~ Resident~ 43 5005 Pave No_A~ Slightly~ ## 9 One_Story_P~ Resident~ 39 5389 Pave No_A~ Slightly~ ## 10 Two_Story_1~ Resident~ 60 7500 Pave No_A~ Regular ## # ... with 2,920 more rows, and 67 more variables: ## # Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>, ## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, ## # Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>, ## # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>, ## # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, ## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, ## # Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, ## # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, ## # BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>, ## # Central_Air <fct>, Electrical <fct>, First_Flr_SF <int>, ## # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, ## # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>, ## # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>, ## # Functional <fct>, Fireplaces <int>, Garage_Type <fct>, ## # Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>, ## # Garage_Cond <fct>, Paved_Drive <fct>, Wood_Deck_SF <int>, ## # Open_Porch_SF <int>, Enclosed_Porch <int>, ## # Three_season_porch <int>, Screen_Porch <int>, Pool_Area <int>, ## # Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, Misc_Val <int>, ## # Mo_Sold <int>, Year_Sold <int>, Sale_Type <fct>, ## # Sale_Condition <fct>, Sale_Price <int>, Longitude <dbl>, ## # Latitude <dbl> ``` ] ??? - 2930 observations, 74 variables - remove quality columns: why? --- class: center, inverse, middle name: parsnip .pull-left70[ # Specify a model with parsnip ] .pull-right30[ <img src="figures//07-parsnip.png" width="100%" /> ] --- class: middle ## Specify a model with `parsnip` .content-box-blue[ .font130[ 1. Pick a **model** 2. Set the **engine** 3. Set the **mode** (if needed) ] ] -- .pull-left[ ```r decision_tree() %>% # model set_engine("rpart") %>% # engine set_mode("classification") # mode ``` ``` ## Decision Tree Model Specification (classification) ## ## Computational engine: rpart ``` ] -- .pull-right[ ```r nearest_neighbor() %>% set_engine("kknn") %>% set_mode("regression") ``` ``` ## K-Nearest Neighbor Model Specification (regression) ## ## Computational engine: kknn ``` ] --- class: middle All available models are listed at <https://www.tidymodels.org/find/parsnip/#models>. <iframe src="https://www.tidymodels.org/find/parsnip/#models" width="100%" height="450px"></iframe> --- class: middle .left-column[ .content-box-blue[ .font130[ 1\. Pick a **model** .fade[ 2\. Set the **engine** 3\. Set the **mode** ] ] ] ] .right-column[ ## `linear_reg()` Specify a model that uses linear regression: ```r linear_reg( mode = "regression", # type of model (only "regression" here) penalty = NULL, # amount of regularization mixture = NULL # proportion of L1 regularization ) ``` ] ??? linear_reg() is a way to generate a specification of a model before fitting and allows the model to be created using different packages in R, Stan, keras, or via Spark. The main arguments for the model are: penalty (lambda for glmnet): The total amount of regularization in the model. in other words: the degree of shrinking the model coefficients towards 0 mixture (alpha for glmnet): The proportion of L1 regularization in the model. One of the extreme cases "Lasso" or "ridge", or a combination of the two. --- class: middle .left-column[ .content-box-blue[ .font130[ .fade[1\. Pick a **model**] 2\. Set the **engine** .fade[3\. Set the **mode**] ] ] ] .right-column[ ## `set_engine()` Add an engine to power or implement the model: ```r linear_reg() %>% * set_engine(engine = "lm", ...) ``` Available engines for `linear_reg()`: - R: "lm" (the default) or "glmnet" - Stan: "stan" - Spark: "spark" - keras: "keras" ] --- class: middle .left-column[ .content-box-blue[ .font130[ .fade[1\. Pick a **model** 2\. Set the **engine**] 3\. Set the **mode** ] ] ] .right-column[ ## `set_mode()` Set the model type, either `"regression"` or `"classification"`. Not necessary if mode is set in Step 1. ```r linear_reg() %>% set_engine(engine = "lm") %>% * set_mode(mode = "regression") ``` ] --- ## `fit()` `fit()`: fit a simple linear regression model to predict _sale price_ based on _above ground living area_. .pull-left[ ```r lm_spec <- linear_reg() %>% set_engine(engine = "lm") %>% set_mode(mode = "regression") *m <- fit( * lm_spec, # parsnip model spec * Sale_Price ~ Gr_Liv_Area, # formula * ames # data frame *) m ``` ``` ## parsnip model object ## ## Fit time: 10ms ## ## Call: ## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data) ## ## Coefficients: ## (Intercept) Gr_Liv_Area ## 13289.6 111.7 ``` ] .pull-right[ <img src="figures/_gen/07/linear-reg-5-1.png" width="425.196850393701" /> ] ??? - until now, we have only _specified_ the model, but we haven't run it. - fit(): fit a model using the parsnip model spec, a formula (lhs: target attribute, rhs: predictors) and the training data --- ## `predict()` `predict()`: use a fitted model to predict new response values from data. Returns a tibble. .pull-left[ ```r p <- predict(m, new_data = ames) p ``` ``` ## # A tibble: 2,930 x 1 ## .pred ## <dbl> ## 1 198255. ## 2 113367. ## 3 161731. ## 4 248964. ## 5 195239. ## 6 192447. ## 7 162736. ## 8 156258. ## 9 193787. ## 10 214786. ## # ... with 2,920 more rows ``` ] .pull-right[ <img src="figures/_gen/07/linear-reg-7-1.png" width="425.196850393701" /> ] ??? - residuals: difference between observed and predicted values --- class: center, inverse, middle name: yardstick .pull-left70[ # Measure model performance with yardstick ] .pull-right30[ <img src="figures//07-yardstick.png" width="100%" /> ] --- class: middle ## Measure the model performance with `yardstick::rmse()` - **Residuals**. The difference between observed and predicted values: `\(\hat{y}_i-y_i\)`. - **Mean Absolute Error**. `\(\frac{1}{n}\sum_{i=1}^n|\hat{y}_i-y_i|\)`. - **Root Mean Squared Error**. `\(\sqrt{\frac{1}{n}\sum_{i=1}^n(\hat{y}_i-y_i)^2}\)`. -- Calculate the RMSE based on two columns in a data frame: - truth `\(y_i\)` - predicted estimate `\(\hat{y}\)` ```r lm_spec <- linear_reg() %>% set_engine(engine = "lm") %>% set_mode(mode = "regression") lm_fit <- fit(object = lm_spec, formula = Sale_Price ~ Gr_Liv_Area, data = ames) price_pred <- lm_fit %>% predict(new_data = ames) %>% mutate(truth = ames$Sale_Price) *rmse(price_pred, truth = truth, estimate = .pred) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 56505. ``` --- ## Available metrics in yardstick <https://yardstick.tidymodels.org/articles/metric-types.html#metrics> <iframe src="https://yardstick.tidymodels.org/articles/metric-types.html#metrics" width="100%" height="450px"></iframe> --- class: center, inverse, middle name: rsample .pull-left70[ # Perform resampling with rsample ] .pull-right30[ <img src="figures//07-rsample.png" width="100%" /> ] ??? - so far, we have evaluated model performance on training data which gives us too optimistic estimates of the true model performance - we need to evaluate the model on a test dataset that is independent from the dataset used for model training --- class: middle ## `initial_split()` `initial_split()`: partition data randomly into a single training and a single test set. ```r set.seed(123) (ames_split <- initial_split(ames, prop = 3/4)) # prop = proportion of training instances ``` ``` ## <Analysis/Assess/Total> ## <2198/732/2930> ``` --- ## `training()` and `testing()` Extract training and testing sets from an `rsplit` object: .pull-left[ ```r training(ames_split) ``` ``` ## # A tibble: 2,198 x 74 ## MS_SubClass MS_Zoning Lot_Frontage ## <fct> <fct> <dbl> ## 1 One_Story_1946_~ Residential~ 141 ## 2 One_Story_1946_~ Residential~ 80 ## 3 One_Story_1946_~ Residential~ 81 ## 4 One_Story_1946_~ Residential~ 93 ## 5 Two_Story_1946_~ Residential~ 74 ## 6 Two_Story_1946_~ Residential~ 78 ## 7 One_Story_PUD_1~ Residential~ 41 ## 8 Two_Story_1946_~ Residential~ 75 ## 9 One_Story_1946_~ Residential~ 0 ## 10 One_Story_1946_~ Residential~ 85 ## # ... with 2,188 more rows, and 71 more ## # variables: Lot_Area <int>, Street <fct>, ## # Alley <fct>, Lot_Shape <fct>, ## # Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, ## # Neighborhood <fct>, Condition_1 <fct>, ## # Condition_2 <fct>, Bldg_Type <fct>, ## # House_Style <fct>, Overall_Cond <fct>, ## # Year_Built <int>, Year_Remod_Add <int>, ## # Roof_Style <fct>, Roof_Matl <fct>, ## # Exterior_1st <fct>, Exterior_2nd <fct>, ## # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, ## # Exter_Cond <fct>, Foundation <fct>, ## # Bsmt_Cond <fct>, Bsmt_Exposure <fct>, ## # BsmtFin_Type_1 <fct>, ## # BsmtFin_SF_1 <dbl>, ## # BsmtFin_Type_2 <fct>, ## # BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating <fct>, ## # Heating_QC <fct>, Central_Air <fct>, ## # Electrical <fct>, First_Flr_SF <int>, ## # Second_Flr_SF <int>, Gr_Liv_Area <int>, ## # Bsmt_Full_Bath <dbl>, ## # Bsmt_Half_Bath <dbl>, Full_Bath <int>, ## # Half_Bath <int>, Bedroom_AbvGr <int>, ## # Kitchen_AbvGr <int>, ## # TotRms_AbvGrd <int>, Functional <fct>, ## # Fireplaces <int>, Garage_Type <fct>, ## # Garage_Finish <fct>, Garage_Cars <dbl>, ## # Garage_Area <dbl>, Garage_Cond <fct>, ## # Paved_Drive <fct>, Wood_Deck_SF <int>, ## # Open_Porch_SF <int>, ## # Enclosed_Porch <int>, ## # Three_season_porch <int>, ## # Screen_Porch <int>, Pool_Area <int>, ## # Pool_QC <fct>, Fence <fct>, ## # Misc_Feature <fct>, Misc_Val <int>, ## # Mo_Sold <int>, Year_Sold <int>, ## # Sale_Type <fct>, Sale_Condition <fct>, ## # Sale_Price <int>, Longitude <dbl>, ## # Latitude <dbl> ``` ] .pull-right[ ```r testing(ames_split) ``` ``` ## # A tibble: 732 x 74 ## MS_SubClass MS_Zoning Lot_Frontage ## <fct> <fct> <dbl> ## 1 One_Story_PUD_1~ Residential~ 43 ## 2 One_Story_PUD_1~ Residential~ 39 ## 3 Two_Story_1946_~ Residential~ 60 ## 4 Two_Story_1946_~ Residential~ 63 ## 5 Two_Story_1946_~ Residential~ 47 ## 6 One_Story_1946_~ Residential~ 88 ## 7 One_Story_1946_~ Residential~ 0 ## 8 Two_Story_PUD_1~ Residential~ 21 ## 9 One_Story_1946_~ Residential~ 95 ## 10 One_Story_1946_~ Residential~ 70 ## # ... with 722 more rows, and 71 more ## # variables: Lot_Area <int>, Street <fct>, ## # Alley <fct>, Lot_Shape <fct>, ## # Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, ## # Neighborhood <fct>, Condition_1 <fct>, ## # Condition_2 <fct>, Bldg_Type <fct>, ## # House_Style <fct>, Overall_Cond <fct>, ## # Year_Built <int>, Year_Remod_Add <int>, ## # Roof_Style <fct>, Roof_Matl <fct>, ## # Exterior_1st <fct>, Exterior_2nd <fct>, ## # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, ## # Exter_Cond <fct>, Foundation <fct>, ## # Bsmt_Cond <fct>, Bsmt_Exposure <fct>, ## # BsmtFin_Type_1 <fct>, ## # BsmtFin_SF_1 <dbl>, ## # BsmtFin_Type_2 <fct>, ## # BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating <fct>, ## # Heating_QC <fct>, Central_Air <fct>, ## # Electrical <fct>, First_Flr_SF <int>, ## # Second_Flr_SF <int>, Gr_Liv_Area <int>, ## # Bsmt_Full_Bath <dbl>, ## # Bsmt_Half_Bath <dbl>, Full_Bath <int>, ## # Half_Bath <int>, Bedroom_AbvGr <int>, ## # Kitchen_AbvGr <int>, ## # TotRms_AbvGrd <int>, Functional <fct>, ## # Fireplaces <int>, Garage_Type <fct>, ## # Garage_Finish <fct>, Garage_Cars <dbl>, ## # Garage_Area <dbl>, Garage_Cond <fct>, ## # Paved_Drive <fct>, Wood_Deck_SF <int>, ## # Open_Porch_SF <int>, ## # Enclosed_Porch <int>, ## # Three_season_porch <int>, ## # Screen_Porch <int>, Pool_Area <int>, ## # Pool_QC <fct>, Fence <fct>, ## # Misc_Feature <fct>, Misc_Val <int>, ## # Mo_Sold <int>, Year_Sold <int>, ## # Sale_Type <fct>, Sale_Condition <fct>, ## # Sale_Price <int>, Longitude <dbl>, ## # Latitude <dbl> ``` ] --- ## Stratified sampling ```r *initial_split(ames, strata = Sale_Price, breaks = 6) ``` <img src="figures/_gen/07/strat-sampling-1-1.png" width="708.661417322835" /> ??? - apply equal-frequency binning on the target variable and draw train/test instances with the specified split percentages from each bin - to ensure that we have (approx.) the same ratio of train/test instances in each bin General drawback of holdout method: - If training set is small, model fit may be poor - If testing set is small, performance values have high variance -> resampling --- ## Cross-validation with `vfold_cv()` General syntax: ```r vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, ...) ``` -- .pull-left60[ Example: 10-fold CV on ames data: ```r set.seed(123) (folds <- vfold_cv(ames, v = 5)) ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 2 ## splits id ## <list> <chr> ## 1 <split [2344/586]> Fold1 ## 2 <split [2344/586]> Fold2 ## 3 <split [2344/586]> Fold3 ## 4 <split [2344/586]> Fold4 ## 5 <split [2344/586]> Fold5 ``` Check whether mean `\(y\)` is approx. equal in each training fold: ```r map_dbl(folds$splits, ~mean(.x$data$Sale_Price[.x$in_id])) ``` ``` ## [1] 181310.8 180991.0 180840.0 181268.6 ## [5] 179569.9 ``` ] .pull-right40[ <img src="figures/_gen/07/cv-4-1.png" width="425.196850393701" /> ] ??? - `vfold_cv()` also has a strata argument --- ## Calculate the model performance on multiple resamples with `fit_resamples()` ```r res <- fit_resamples(lm_spec, Sale_Price ~ Gr_Liv_Area, resamples = folds) res ``` ``` ## # Resampling results ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [2344/586]> Fold1 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 2 <split [2344/586]> Fold2 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 3 <split [2344/586]> Fold3 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 4 <split [2344/586]> Fold4 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 5 <split [2344/586]> Fold5 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ``` ??? - instead of fit, we need fit_resamples because we have more than 1 split - returns a tibble with 5 rows (number of resamples) - several list columns (add pull, pluck) - `splits`: info on training and test set assignment in resample - `.metrics`: model performance - `.notes`: contains information in case an error has occurred --- ## Collapse performance results across resamples with `collect_metrics()` ```r res %>% collect_metrics() ``` ``` ## # A tibble: 2 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 56486. 5 1866. Preprocessor1_Model1 ## 2 rsq standard 0.504 5 0.0193 Preprocessor1_Model1 ``` ```r res %>% collect_metrics(summarize = FALSE) ``` ``` ## # A tibble: 10 x 5 ## id .metric .estimator .estimate .config ## <chr> <chr> <chr> <dbl> <chr> ## 1 Fold1 rmse standard 51064. Preprocessor1_Model1 ## 2 Fold1 rsq standard 0.542 Preprocessor1_Model1 ## 3 Fold2 rmse standard 57206. Preprocessor1_Model1 ## 4 Fold2 rsq standard 0.464 Preprocessor1_Model1 ## 5 Fold3 rmse standard 53526. Preprocessor1_Model1 ## 6 Fold3 rsq standard 0.557 Preprocessor1_Model1 ## 7 Fold4 rmse standard 61210. Preprocessor1_Model1 ## 8 Fold4 rsq standard 0.468 Preprocessor1_Model1 ## 9 Fold5 rmse standard 59422. Preprocessor1_Model1 ## 10 Fold5 rsq standard 0.488 Preprocessor1_Model1 ``` ??? - `collect_metrics`: helper function to expand the `.metrics` column - if summarize = TRUE (default), it averages across all folds - this code is the same as res %>% collect_metrics(summarize = FALSE): unnest(res %>% select(id, .metrics), cols = .metrics) --- ## `metric_set()` `metric_set()`: a helper function for selecting yardstick metric functions. .pull-left[ ```r fit_resamples( object, resamples, ..., * metrics = metric_set(rmse, rsq), control = control_resamples() ) ``` ] .pull-right[ .content-box-blue[ If `metrics = NULL`: - regression: `metric_set(rmse, rsq)` - classification: `metric_set(accuracy, roc_auc)` ] ] ??? - rmse and rsq are functions --- ## Other resampling methods - `loo_cv()`: leave-one-out CV - `mc_cv()`: repeated holdout / Monte Carlo (random) CV: test sets sampled without replacement - `bootstraps()`: test sets sampled with replacement <img src="figures/_gen/07/rsample-other-resampling-1.png" width="963.779527559055" /> --- ## A classification example ```r stackoverflow <- read_rds(here::here("data/stackoverflow.rds")) glimpse(stackoverflow) ``` ``` ## Rows: 1,150 ## Columns: 21 ## $ country <fct> United States, United States, United Kingdo~ ## $ salary <dbl> 63750.00, 93000.00, 40625.00, 45000.00, 100~ ## $ years_coded_job <int> 4, 9, 8, 3, 8, 12, 20, 17, 20, 4, 3, 13, 16~ ## $ open_source <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1~ ## $ hobby <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1~ ## $ company_size_number <dbl> 20, 1000, 10000, 1, 10, 100, 20, 500, 1, 20~ ## $ remote <fct> Remote, Remote, Remote, Remote, Remote, Rem~ ## $ career_satisfaction <int> 8, 8, 5, 10, 8, 10, 9, 7, 8, 7, 9, 8, 8, 7,~ ## $ data_scientist <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ ## $ database_administrator <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0~ ## $ desktop_applications_developer <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0~ ## $ developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~ ## $ dev_ops <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0~ ## $ embedded_developer <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0~ ## $ graphic_designer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ ## $ graphics_programming <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ ## $ machine_learning_specialist <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ ## $ mobile_developer <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1~ ## $ quality_assurance_engineer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ ## $ systems_administrator <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~ ## $ web_developer <dbl> 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1~ ``` .font80[Data source: [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey)] ??? - what makes a developer more likely to work remotely? - Developers can work in their company offices or they can work remotely, and it turns out that there are specific characteristics of developers, such as the size of the company that they work for, how much experience they have, or where in the world they live, that affect how likely they are to be a remote developer. --- class: middle ## Specify a classification model .left-column[ .content-box-blue[ .font130[ 1\. Pick a **model** 2\. Set the **engine** 3\. Set the **mode** ] ] ] -- .right-column[ Specify a decision tree model with default parameter settings: ```r vanilla_tree_spec <- decision_tree() %>% set_engine("rpart") %>% * set_mode("classification") ``` ] --- class: middle Measure the performance of a vanilla decision tree model using 5-fold CV: ```r set.seed(100) so_cv <- vfold_cv(stackoverflow, v = 5) (fit_van_res <- fit_resamples(vanilla_tree_spec, remote ~ ., resamples = so_cv) %>% collect_metrics()) ``` ``` ## # A tibble: 2 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.639 5 0.00870 Preprocessor1_Model1 ## 2 roc_auc binary 0.663 5 0.0155 Preprocessor1_Model1 ``` -- 🤔 _"Can we improve the performance by tuning the algorithm parameters?"_ -- 🤔 _"Which parameters can we tune?"_ --- class: middle ## args() `args()` prints the arguments for a parsnip model specification: ```r args(decision_tree) ``` ``` ## function (mode = "unknown", cost_complexity = NULL, tree_depth = NULL, ## min_n = NULL) ## NULL ``` -- Arguments of `decision_tree()`: - `cost_complexity`: minimum fit improvement of a split (0 < `cost_complexity` `\(\leq\)` 1) - `tree_depth`: maximum number of levels in the tree - `min_n`: minimum number of observations in a node in order for a split to be attempted --- class: middle ```r decision_tree( cost_complexity = 0.01, # min. fit improvement of a split (0 < cp <=1) tree_depth = 30, # max. number of levels in the tree min_n = 20 # min. number of observations in a node in order for a split to be attempted ) ``` ``` ## Decision Tree Model Specification (unknown) ## ## Main Arguments: ## cost_complexity = 0.01 ## tree_depth = 30 ## min_n = 20 ``` -- If the arguments are left to their defaults (`NULL`), the arguments will use the engine's underlying model functions default value. For example, `rpart` is used as default engine. The default parameters are: ```r args(rpart::rpart.control) # cost_complexity -> cp; tree_depth -> maxdepth; min_n -> minsplit ``` ``` ## function (minsplit = 20L, minbucket = round(minsplit/3), cp = 0.01, ## maxcompete = 4L, maxsurrogate = 5L, usesurrogate = 2L, xval = 10L, ## surrogatestyle = 0L, maxdepth = 30L, ...) ## NULL ``` --- class: middle ## `set_args()` `set_args()`: **change** the arguments for a parsnip model specification: ```r dt_spec <- decision_tree() dt_spec %>% set_args(tree_depth = 3) ``` ``` ## Decision Tree Model Specification (unknown) ## ## Main Arguments: ## tree_depth = 3 ``` -- .pull-left[ ... which is equivalent to: ```r dt_spec <- decision_tree(tree_depth = 3) dt_spec ``` ``` ## Decision Tree Model Specification (unknown) ## ## Main Arguments: ## tree_depth = 3 ``` ] -- .pull-right[ An example spec of model, engine, mode and tree depth: ```r decision_tree() %>% set_engine("rpart") %>% set_mode("classification") %>% set_args(tree_depth = 3) ``` ``` ## Decision Tree Model Specification (classification) ## ## Main Arguments: ## tree_depth = 3 ## ## Computational engine: rpart ``` ] --- class: middle, center <img src="figures/_gen/07/cp-1-1.png" width="1020.47244094488" /> ??? <!-- parsimonious model --> --- class: middle, center <img src="figures/_gen/07/cp-2-1.png" width="1020.47244094488" /> --- class: middle, center .pull-left[ Overfitted tree (`cost_complexity`=0.0008): <img src="figures/_gen/07/rpart-overfitted-1.png" width="425.196850393701" /> ] .pull-right[ Optimal tree (`cost_complexity`=0.0093): <img src="figures/_gen/07/rpart-ideal-size-1.png" width="425.196850393701" /> ] --- class: middle ## `workflow()` Create a workflow with `workflow()`. ??? - to perform hyperparameter tuning, we need to create a workflow object - workflow: bundle together preprocessing, modeling and postprocessing - easier to see the benefits of workflows with examples... -- ## `add_formula()` Add a formula to a workflow `workflow() %>% add_formula(Sale_Price ~ Year)` -- ## `add_model()` Add a parsnip model spec to a workflow: `workflow() %>% add_model(lm_spec)` --- ## Example workflow <!-- # tree <- fit(wf, stackoverflow) %>% pull_workflow_fit() --> ```r wf <- workflow() %>% add_formula(remote ~ .) %>% add_model(decision_tree() %>% set_engine("rpart") %>% set_mode("classification")) wf %>% fit_resamples(so_cv) ``` ``` ## # Resampling results ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [920/230]> Fold1 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 2 <split [920/230]> Fold2 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 3 <split [920/230]> Fold3 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 4 <split [920/230]> Fold4 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ## 5 <split [920/230]> Fold5 <tibble[,4] [2 x 4]> <tibble[,1] [0 x 1]> ``` ??? - we do not need to specify a formula within the fitting function --- class: middle ## `update_formula()` Replace a workflow formula with a new one: ```r workflow() %>% add_formula(remote ~ .) %>% * update_formula(remote ~ salary + open_source) ``` ``` ## == Workflow ============================================================================== ## Preprocessor: Formula ## Model: None ## ## -- Preprocessor -------------------------------------------------------------------------- ## remote ~ salary + open_source ``` --- class: middle ## `update_model()` Replaces a workflow model spec with a new one: ```r workflow() %>% add_model(nearest_neighbor()) %>% update_model(decision_tree()) ``` ``` ## == Workflow ============================================================================== ## Preprocessor: None ## Model: decision_tree() ## ## -- Model --------------------------------------------------------------------------------- ## Decision Tree Model Specification (unknown) ``` --- class: center, inverse, middle .pull-left70[ # Tune model hyperparameters with tune ] .pull-right30[ <img src="figures//07-tune.png" width="100%" /> ] --- class: middle ## `tune()` `tune()` is a placeholder for hyperparameters that are to be tuned: ```r decision_tree(cost_complexity = tune()) ``` ``` ## Decision Tree Model Specification (unknown) ## ## Main Arguments: ## cost_complexity = tune() ``` --- ## `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ```r tune_grid( object, # a model workflow, R formula or recipe object. resamples, # a resampling object, e.g. the output of vfold_cv() ..., grid = 10, # the number of tuning iterations or a data frame of tuning operations (tuning grid) metrics = NULL, # yardstick::metric_set() or NULL control = control_grid() # An object used to modify the tuning process ) ``` ??? recipes will be discussed later --- class: middle ## `expand_grid()` `tidyr::expand_grid()`: takes one or more vectors, and returns a data frame holding all combinations of their values. ```r expand_grid(cost_complexity = 10^(0:-5), min_n = seq(4,20,4)) ``` ``` ## # A tibble: 30 x 2 ## cost_complexity min_n ## <dbl> <dbl> ## 1 1 4 ## 2 1 8 ## 3 1 12 ## 4 1 16 ## 5 1 20 ## 6 0.1 4 ## 7 0.1 8 ## 8 0.1 12 ## 9 0.1 16 ## 10 0.1 20 ## # ... with 20 more rows ``` .footnote[`expand_grid()` is a re-implementation of the base `expand.grid()`.] --- class: middle ```r dt_spec <- decision_tree( * cost_complexity = tune(), * tree_depth = tune() ) %>% set_engine("rpart") %>% set_mode("classification") dt_wf <- workflow() %>% add_model(dt_spec) %>% add_formula(remote ~ .) dt_res <- dt_wf %>% tune_grid(resamples = so_cv, * grid = expand_grid(cost_complexity = 10^-(1:5), tree_depth = 1:6) ) dt_res ``` ``` ## # Tuning results ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [920/230]> Fold1 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]> ## 2 <split [920/230]> Fold2 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]> ## 3 <split [920/230]> Fold3 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]> ## 4 <split [920/230]> Fold4 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]> ## 5 <split [920/230]> Fold5 <tibble[,6] [60 x 6]> <tibble[,1] [0 x 1]> ``` ??? 1. specify parsnip model 1. create workflow, add parsnip model and the formula 1. invoke `tune_grid()` on the workflow and the tuning grid we create with `expand_grid()` - `dt_res`: performance for each fold stored in list column `.metrics` --- class: middle ```r dt_res %>% collect_metrics() %>% filter(.metric == "accuracy") %>% arrange(desc(mean)) ``` ``` ## # A tibble: 30 x 8 ## cost_complexity tree_depth .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~ ## 2 0.0001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~ ## 3 0.00001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model~ ## 4 0.01 2 accuracy binary 0.656 5 0.0142 Preprocessor1_Model~ ## 5 0.01 3 accuracy binary 0.649 5 0.0142 Preprocessor1_Model~ ## 6 0.001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~ ## 7 0.001 6 accuracy binary 0.646 5 0.00918 Preprocessor1_Model~ ## 8 0.0001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~ ## 9 0.0001 6 accuracy binary 0.646 5 0.00918 Preprocessor1_Model~ ## 10 0.00001 5 accuracy binary 0.646 5 0.00488 Preprocessor1_Model~ ## # ... with 20 more rows ``` --- class: middle ## `show_best()` `show_best()`: display the `n` best hyperparameters combinations according to a `metric`: ```r dt_res %>% show_best(metric = "accuracy", n = 5) ``` ``` ## # A tibble: 5 x 8 ## cost_complexity tree_depth .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model14 ## 2 0.0001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model20 ## 3 0.00001 2 accuracy binary 0.66 5 0.0158 Preprocessor1_Model26 ## 4 0.01 2 accuracy binary 0.656 5 0.0142 Preprocessor1_Model08 ## 5 0.01 3 accuracy binary 0.649 5 0.0142 Preprocessor1_Model09 ``` --- class: middle ## `autoplot()` `autoplot()`: quickly visualize tuning results ```r dt_res %>% autoplot() ``` <img src="figures/_gen/07/dt-tune-4-autoplot-1.png" width="623.622047244095" style="display: block; margin: auto;" /> --- class: middle ## `select_best()` `select_best()` returns the best combination of hyperparameters according to a metric: ```r so_best <- dt_res %>% select_best(metric = "roc_auc") so_best ``` ``` ## # A tibble: 1 x 3 ## cost_complexity tree_depth .config ## <dbl> <int> <chr> ## 1 0.001 2 Preprocessor1_Model14 ``` ??? - returns the first combination in case of ties --- class: middle ## `finalize_workflow()` `finalize_workflow()`: replaces `tune()` placeholders in a model/recipe/workflow with a set of hyper-parameter values. ```r dt_wf_final <- dt_wf %>% finalize_workflow(so_best) dt_wf_final ``` ``` ## == Workflow ============================================================================== ## Preprocessor: Formula ## Model: decision_tree() ## ## -- Preprocessor -------------------------------------------------------------------------- ## remote ~ . ## ## -- Model --------------------------------------------------------------------------------- ## Decision Tree Model Specification (classification) ## ## Main Arguments: *## cost_complexity = 0.001 *## tree_depth = 2 ## ## Computational engine: rpart ``` ??? last_fit() vignettes of a package? vignette(package="grid") grid_random(cost_complexity(), tree_depth()) --- class: inverse, middle .pull-left70[ # Preprocessing with recipes ] .pull-right30[ <img src="figures//07-recipes.png" width="100%" /> ] --- class: middle .content-box-blue[ .font130[ 1\. Create a `recipe()` 2\. Define the predictor and outcome variables 3\. Add one or more preprocessing step _specifications_ 4\. Calculate statistics from the training set 5\. Apply preprocessing to datasets ] ] --- class: middle .left-column[ .content-box-blue[ .font130[ 1\. Create a `recipe()` 2\. Define the predictor and outcome variables .fade[ 3\. Add one or more preprocessing step _specifications_ 4\. Calculate statistics from the training set 5\. Apply preprocessing to datasets ] ] ] ] .right-column[ ## recipe() `recipe()`: create a recipe by specifying predictors, responses and reference (_template_) data frame. ```r *recipe(Sale_Price ~ ., data = ames) ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 73 ``` ] --- class: middle .left-column[ .content-box-blue[ .font130[ .fade[1\. Create a `recipe()`] .fade[2\. Define the predictor and outcome variables] 3\. Add one or more preprocessing step _specifications_ .fade[ 4\. Calculate statistics from the training set 5\. Apply preprocessing to datasets ] ] ] ] .right-column[ ## step_*() `step_*()`: add preprocessing step specifications in the order they will be performed. ```r recipe(Sale_Price ~ ., data = ames) %>% # step_novel(): assign a previously unseen factor level to # a new value * step_novel(all_nominal()) %>% # step_zv(): zero variance filter: remove vars that contain # only a single value * step_zv(all_predictors()) ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 73 ## ## Operations: ## ## Novel factor level assignment for all_nominal() ## Zero variance filter on all_predictors() ``` ] ??? - How does recipes know what is a predictor and what is an outcome? > formula - How does recipes know what is numeric and what is nominal? > data argument Preprocessing and Feature Engineering This part mostly concerns what we can do to our variables to make the models more effective. This is mostly related to the predictors. Operations that we might use are: transformations of individual predictors or groups of variables alternate encodings of a variable elimination of predictors (unsupervised) In statistics, this is generally called preprocessing the data. As usual, the computer science side of modeling has a much flashier name: feature engineering. Reasons for Modifying the Data Some models (K-NN, SVMs, PLS, neural networks) require that the predictor variables have the same units. Centering and scaling the predictors can be used for this purpose. Other models are very sensitive to correlations between the predictors and filters or PCA signal extraction can improve the model. As we'll see in an example, changing the scale of the predictors using a transformation can lead to a big improvement. In other cases, the data can be encoded in a way that maximizes its effect on the model. Representing the date as the day of the week can be very effective for modeling public transportation data. Many models cannot cope with missing data so imputation strategies might be necessary. Development of new features that represent something important to the outcome (e.g. compute distances to public transportation, university buildings, public schools, etc.) --- ## step_*() Complete list at: <https://recipes.tidymodels.org/reference/index.html> <iframe src="https://tidymodels.github.io/recipes/reference/index.html#section-step-functions-imputation" width="100%" height="450px"></iframe> --- ## Selectors **Selectors**, e.g., `all_nominal()` and `all_predictors()` are helper functions for selecting sets of variables, which behave similar to the select helpers from `dplyr`. ```r rec %>% step_novel(all_nominal()) %>% step_zv(all_predictors()) ``` -- .font130[ |selector |description | |:-------------------------|:-----------------------------------------------| |`all_predictors()` |Each x variable (right side of ~) | |`all_outcomes()` |Each y variable (left side of ~) | |`all_numeric()` |Each numeric variable | |`all_nominal()` |Each categorical variable (e.g. factor, string) | |`dplyr::select()` helpers |`starts_with('Lot_')`, etc. | ] --- class: middle .left-column[ .content-box-blue[ .font130[ .fade[1\. Create a `recipe()`] .fade[2\. Define the predictor and outcome variables] .fade[3\. Add one or more preprocessing step _specifications_] 4\. Calculate statistics from the training set .fade[5\. Apply preprocessing to datasets] ] ] ] .right-column[ ## `prep()` `prep()` "trains" a recipe, i.e., calculates statistics from the training data ```r recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_zv(all_predictors()) %>% * prep(training = training(ames_split)) ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 73 ## ## Training data contained 2198 data points and no missing data. ## ## Operations: ## ## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, Alley, ... [trained] ## Zero variance filter removed no terms [trained] ``` ] --- class: middle .left-column[ .content-box-blue[ .font130[ .fade[1\. Create a `recipe()`] .fade[2\. Define the predictor and outcome variables] .fade[3\. Add one or more preprocessing step _specifications_] .fade[4\. Calculate statistics from the training set] 5\. Apply preprocessing to datasets ] ] ] .right-column[ ## `bake()` `bake()` transforms data with the prepped recipe ```r recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_zv(all_predictors()) %>% prep(training = training(ames_split)) %>% * bake(new_data = testing(ames_split)) # or training(ames_split) ``` ``` ## # A tibble: 732 x 74 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour ## <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fct> ## 1 One_Story_PUD_~ Residentia~ 43 5005 Pave No_Al~ Slightly_~ HLS ## 2 One_Story_PUD_~ Residentia~ 39 5389 Pave No_Al~ Slightly_~ Lvl ## 3 Two_Story_1946~ Residentia~ 60 7500 Pave No_Al~ Regular Lvl ## 4 Two_Story_1946~ Residentia~ 63 8402 Pave No_Al~ Slightly_~ Lvl ## 5 Two_Story_1946~ Residentia~ 47 53504 Pave No_Al~ Moderatel~ HLS ## 6 One_Story_1946~ Residentia~ 88 11394 Pave No_Al~ Regular Lvl ## 7 One_Story_1946~ Residentia~ 0 11241 Pave No_Al~ Slightly_~ Lvl ## 8 Two_Story_PUD_~ Residentia~ 21 1680 Pave No_Al~ Regular Lvl ## 9 One_Story_1946~ Residentia~ 95 12182 Pave No_Al~ Regular Lvl ## 10 One_Story_1946~ Residentia~ 70 10171 Pave No_Al~ Slightly_~ Lvl ## # ... with 722 more rows, and 66 more variables: Utilities <fct>, Lot_Config <fct>, ## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>, ## # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>, ## # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, ## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, ## # Foundation <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, ## # BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>, Central_Air <fct>, ## # Electrical <fct>, First_Flr_SF <int>, Second_Flr_SF <int>, Gr_Liv_Area <int>, ## # Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>, ## # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>, Functional <fct>, ## # Fireplaces <int>, Garage_Type <fct>, Garage_Finish <fct>, Garage_Cars <dbl>, ## # Garage_Area <dbl>, Garage_Cond <fct>, Paved_Drive <fct>, Wood_Deck_SF <int>, ## # Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>, ## # Screen_Porch <int>, Pool_Area <int>, Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, ## # Misc_Val <int>, Mo_Sold <int>, Year_Sold <int>, Sale_Type <fct>, ## # Sale_Condition <fct>, Longitude <dbl>, Latitude <dbl>, Sale_Price <int> ``` ] ??? actually, you don't need to do this! The fit functions do it for you --- class: bottom, right background-image: url("figures/07-recipes-workflow.png") background-size: contain .font70[[Source](https://twitter.com/allison_horst/status/1159809527023198209?s=20)] --- ## `juice()` `juice()` returns the preprocessed training data back from a prepped recipe, without having to rerun the preprocessing steps on the training data. ```r rec <- recipe(Sale_Price ~ ., data = ames) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) rec %>% prep(training = training(ames_split), * retain = TRUE ) %>% * juice() ``` ``` ## # A tibble: 2,198 x 74 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour ## <fct> <fct> <dbl> <dbl> <fct> <fct> <fct> <fct> ## 1 One_Story_1946~ Residentia~ 2.46 2.64 Pave No_Al~ Slightly_~ Lvl ## 2 One_Story_1946~ Residentia~ 0.658 0.185 Pave No_Al~ Regular Lvl ## 3 One_Story_1946~ Residentia~ 0.687 0.507 Pave No_Al~ Slightly_~ Lvl ## 4 One_Story_1946~ Residentia~ 1.04 0.128 Pave No_Al~ Regular Lvl ## 5 Two_Story_1946~ Residentia~ 0.480 0.454 Pave No_Al~ Slightly_~ Lvl ## 6 Two_Story_1946~ Residentia~ 0.598 -0.0156 Pave No_Al~ Slightly_~ Lvl ## 7 One_Story_PUD_~ Residentia~ -0.496 -0.632 Pave No_Al~ Regular Lvl ## 8 Two_Story_1946~ Residentia~ 0.510 -0.0129 Pave No_Al~ Slightly_~ Lvl ## 9 One_Story_1946~ Residentia~ -1.71 -0.259 Pave No_Al~ Slightly_~ Lvl ## 10 One_Story_1946~ Residentia~ 0.805 0.00851 Pave No_Al~ Regular Lvl ## # ... with 2,188 more rows, and 66 more variables: Utilities <fct>, Lot_Config <fct>, ## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>, ## # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <dbl>, ## # Year_Remod_Add <dbl>, Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, ## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, ## # Foundation <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, ## # BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, ## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>, Central_Air <fct>, ## # Electrical <fct>, First_Flr_SF <dbl>, Second_Flr_SF <dbl>, Gr_Liv_Area <dbl>, ## # Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <dbl>, Half_Bath <dbl>, ## # Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>, TotRms_AbvGrd <dbl>, Functional <fct>, ## # Fireplaces <dbl>, Garage_Type <fct>, Garage_Finish <fct>, Garage_Cars <dbl>, ## # Garage_Area <dbl>, Garage_Cond <fct>, Paved_Drive <fct>, Wood_Deck_SF <dbl>, ## # Open_Porch_SF <dbl>, Enclosed_Porch <dbl>, Three_season_porch <dbl>, ## # Screen_Porch <dbl>, Pool_Area <dbl>, Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, ## # Misc_Val <dbl>, Mo_Sold <dbl>, Year_Sold <dbl>, Sale_Type <fct>, ## # Sale_Condition <fct>, Longitude <dbl>, Latitude <dbl>, Sale_Price <dbl> ``` .font80[ > "As steps are estimated by `prep()`, these operations are applied to the training set. Rather than running `bake()` to duplicate this processing, this function will return variables from the processed training set." — ?recipes::juice ] ??? There are packages like embed, textrecipes, and themis that extend recipes with new steps. --- exclude: true roles You can also give variables a "role" within a recipe and then select by roles. ```r has_role(match = "privacy") add_role(pca_rec, Fence, new_role = "privacy") update_role(rec, Fence, new_role = "privacy", old_role = "yard") remove_role(rec, Fence, old_role = "yard") ``` --- ## A full workflow ```r set.seed(123) so_cv <- vfold_cv(stackoverflow, v = 5) so_rec <- recipe(remote ~ ., data = stackoverflow) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_corr(all_predictors(), threshold = 0.5) tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") so_wf <- workflow() %>% add_model(tree_spec) %>% * add_recipe(so_rec) *fit_resamples(so_wf, # note: workflow object instead of model spec resamples = so_cv, metrics = metric_set(accuracy, sens, spec), control = control_resamples(save_pred = TRUE)) %>% # collect_metrics() %>% collect_predictions() %>% conf_mat(remote, .pred_class) ``` ``` ## Truth ## Prediction Remote Not remote ## Remote 381 224 ## Not remote 194 351 ``` --- You can tune models **and** recipes! ```r pca_tuner <- recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) %>% * step_pca(all_predictors(), num_comp = tune()) pca_twf <- workflow() %>% add_recipe(pca_tuner) %>% * add_model(nearest_neighbor(neighbors = tune()) %>% set_engine("kknn") %>% set_mode("regression")) *tg <- expand_grid(num_comp = 2:10, neighbors = seq(1, 15, 4)) set.seed(100) cv_folds <- vfold_cv(ames, v = 5, strata = Sale_Price, breaks = 4) set.seed(100) pca_results <- pca_twf %>% tune_grid(resamples = cv_folds, grid = tg) pca_results %>% show_best(metric = "rmse") ``` ``` ## # A tibble: 5 x 8 ## neighbors num_comp .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 9 7 rmse standard 31793. 5 968. Preprocessor6_Model3 ## 2 13 7 rmse standard 31961. 5 1157. Preprocessor6_Model4 ## 3 9 8 rmse standard 31963. 5 1099. Preprocessor7_Model3 ## 4 9 5 rmse standard 32141. 5 951. Preprocessor4_Model3 ## 5 13 8 rmse standard 32180. 5 1234. Preprocessor7_Model4 ``` --- ## Session info .font70[ ``` ## setting value ## version R version 4.0.5 (2021-03-31) ## os Windows 10 x64 ## system x86_64, mingw32 ## ui RTerm ## language (EN) ## collate English_United States.1252 ## ctype English_United States.1252 ## tz Europe/Berlin ## date 2021-05-10 ``` ] <div style="font-size:80%;"> .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> package </th> <th style="text-align:left;"> version </th> <th style="text-align:left;"> date </th> <th style="text-align:left;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> AmesHousing </td> <td style="text-align:left;"> 0.0.4 </td> <td style="text-align:left;"> 2020-06-23 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> broom </td> <td style="text-align:left;"> 0.7.6 </td> <td style="text-align:left;"> 2021-04-05 </td> <td style="text-align:left;"> CRAN (R 4.0.5) </td> </tr> <tr> <td style="text-align:left;"> dials </td> <td style="text-align:left;"> 0.0.9 </td> <td style="text-align:left;"> 2020-09-16 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> dplyr </td> <td style="text-align:left;"> 1.0.5 </td> <td style="text-align:left;"> 2021-03-05 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> forcats </td> <td style="text-align:left;"> 0.5.1 </td> <td style="text-align:left;"> 2021-01-27 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> ggplot2 </td> <td style="text-align:left;"> 3.3.3 </td> <td style="text-align:left;"> 2020-12-30 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> infer </td> <td style="text-align:left;"> 0.5.4 </td> <td style="text-align:left;"> 2021-01-13 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> kableExtra </td> <td style="text-align:left;"> 1.3.4 </td> <td style="text-align:left;"> 2021-02-20 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> kknn </td> <td style="text-align:left;"> 1.3.1 </td> <td style="text-align:left;"> 2016-03-26 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> knitr </td> <td style="text-align:left;"> 1.31 </td> <td style="text-align:left;"> 2021-01-27 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> modeldata </td> <td style="text-align:left;"> 0.1.0 </td> <td style="text-align:left;"> 2020-10-22 </td> <td style="text-align:left;"> CRAN (R 4.0.5) </td> </tr> <tr> <td style="text-align:left;"> parsnip </td> <td style="text-align:left;"> 0.1.5 </td> <td style="text-align:left;"> 2021-01-19 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> patchwork </td> <td style="text-align:left;"> 1.1.1 </td> <td style="text-align:left;"> 2020-12-17 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> purrr </td> <td style="text-align:left;"> 0.3.4 </td> <td style="text-align:left;"> 2020-04-17 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> readr </td> <td style="text-align:left;"> 1.4.0 </td> <td style="text-align:left;"> 2020-10-05 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> </tbody> </table> ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> package </th> <th style="text-align:left;"> version </th> <th style="text-align:left;"> date </th> <th style="text-align:left;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> recipes </td> <td style="text-align:left;"> 0.1.15 </td> <td style="text-align:left;"> 2020-11-11 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> rlang </td> <td style="text-align:left;"> 0.4.11 </td> <td style="text-align:left;"> 2021-04-30 </td> <td style="text-align:left;"> CRAN (R 4.0.5) </td> </tr> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> 4.1.15 </td> <td style="text-align:left;"> 2019-04-12 </td> <td style="text-align:left;"> CRAN (R 4.0.5) </td> </tr> <tr> <td style="text-align:left;"> rpart.plot </td> <td style="text-align:left;"> 3.0.9 </td> <td style="text-align:left;"> 2020-09-17 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> rsample </td> <td style="text-align:left;"> 0.0.9 </td> <td style="text-align:left;"> 2021-02-17 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> scales </td> <td style="text-align:left;"> 1.1.1 </td> <td style="text-align:left;"> 2020-05-11 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> stringr </td> <td style="text-align:left;"> 1.4.0 </td> <td style="text-align:left;"> 2019-02-10 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> tibble </td> <td style="text-align:left;"> 3.1.1 </td> <td style="text-align:left;"> 2021-04-18 </td> <td style="text-align:left;"> CRAN (R 4.0.5) </td> </tr> <tr> <td style="text-align:left;"> tidymodels </td> <td style="text-align:left;"> 0.1.2 </td> <td style="text-align:left;"> 2020-11-22 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> tidyr </td> <td style="text-align:left;"> 1.1.3 </td> <td style="text-align:left;"> 2021-03-03 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> tidyverse </td> <td style="text-align:left;"> 1.3.0 </td> <td style="text-align:left;"> 2019-11-21 </td> <td style="text-align:left;"> CRAN (R 4.0.2) </td> </tr> <tr> <td style="text-align:left;"> tune </td> <td style="text-align:left;"> 0.1.2 </td> <td style="text-align:left;"> 2020-11-17 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> <tr> <td style="text-align:left;"> vctrs </td> <td style="text-align:left;"> 0.3.8 </td> <td style="text-align:left;"> 2021-04-29 </td> <td style="text-align:left;"> CRAN (R 4.0.5) </td> </tr> <tr> <td style="text-align:left;"> workflows </td> <td style="text-align:left;"> 0.2.2 </td> <td style="text-align:left;"> 2021-03-10 </td> <td style="text-align:left;"> CRAN (R 4.0.4) </td> </tr> <tr> <td style="text-align:left;"> yardstick </td> <td style="text-align:left;"> 0.0.7 </td> <td style="text-align:left;"> 2020-07-13 </td> <td style="text-align:left;"> CRAN (R 4.0.3) </td> </tr> </tbody> </table> ] </div> --- class: last-slide, center, bottom # Thank you! Questions? .courtesy[📷 Photo courtesy of Stefan Berger] ??? - tidymodels version 0.1 on CRAN = early development, no major release yet - be aware that code from today may not work with a future version - still, it is very likely that tidymodels will at least have caret's functionality by 2021 and then work better together with the tidyverse, and that it will continuously get better in the near future