class: title-slide, center, bottom # 00 - Course Introduction ## Data Science with R · Summer 2021 ### Uli Niemann · Knowledge Management & Discovery Lab #### [https://brain.cs.uni-magdeburg.de/kmd/DataSciR/](https://brain.cs.uni-magdeburg.de/kmd/DataSciR/) .courtesy[📷 Photo courtesy of Ulrich Arendt] --- class: middle .left-column[ ## Welcome to<br />Data Science with R! ] .right-column[ <img src="figures//00-collage.png" width="100%" /> ] ??? - course organization + first tour of R and RStudio --- class: bottom background-image: url("figures/00-stay-at-home.jpg") background-size: fill .font80[Photo by [Amelie & Niklas Ohlrogge](https://unsplash.com/photos/P_2za841j_w)] --- exclude: true class: middle ## Who are <u>YOU</u>? .pull-left60[ - You are studying either **DKE** or another Master FIN degree with focus on data science. - You have heard one or more **introductory data science courses** in the previous semesters. - "Data Mining I" by Prof. Spiliopoulou - "Machine Learning" by Prof. Nürnberger - "Visual Analytics" by Prof. Preim - ... - While you liked these courses to learn fundamental data science concepts, you have not had the chance to apply this theoretical knowledge in a **practical project on real-world datasets**. - You are most proficient in **Python** (or **Java**). - You have no or very **little experience with R**, but are keen to learn it either because you either heard of R's reputation to be a powerful language for data science projects or simply to complete your programming skill set. ] .pull-right40[ <img src="figures//00-student.jpg" width="100%" /> .font80[Photo by [Wes Hicks](https://unsplash.com/photos/4-EeTnaC1S4)] ] --- class: middle ## Data Science with R <img src="figures//00-datascir-workflow.png" width="100%" /> ??? - at the end of the semester, you should be confidently able to use R to gain insights from data - workflow of a typical DS project - import data from a file, database, website or web API and load as data frame into R - transform and clean - bring it into a form that is consistent with the semantics of the data - transform: feature engineering: derive new variables or do some feature selection or reduction (required most of the time) - visualization: shows things that you did not expect, or raise new questions about the data - hints that you are asking the wrong question or need to collect different data - models: classification, regression - communication: crucial to communicate to others - DS often involves a lot of trial and error - don't know the best approach in advance - often starts with some initial questions - initial questions are often quite vague, because you are working on an unsolved problem - improve quality of the questions after EDA (better data understanding) or first modeling results https://simplystatistics.org/2019/04/17/tukey-design-thinking-and-better-questions/ --- class: middle ## Course structure .pull-left[ .content-box-blue[ .font150[**Part 1: R "crash course"**] <img src="figures//00-crash-course.jpg" width="100%" /> .font80[Photo by [NeONBRAND](https://unsplash.com/photos/1-aA2Fadydc)] ] ] .pull-right[ .content-box-green[ .font150[**Part 2: Data science project**] <img src="figures//00-students.jpg" width="100%" /> .font80[Photo by [Annie Spratt](https://unsplash.com/photos/QckxruozjRg)] ] ] ??? Part 1: until first week of June - asynchronous content: slides, videos, exercises, and quizzes - synchronous elements: weekly meetings (code-along, questions) Part 2: throughout the entire semester --- ## Website & syllabus <iframe src="https://brain.cs.uni-magdeburg.de/kmd/DataSciR#syllabus" width="100%" height="550px"></iframe> ??? first part: in each week we cover a different topic - videos - slides - exercises - quizzes - additional reading - crash course is not meant to be exhaustive - "common ground" like packages that all of you are most likely going to use in your projects - however, you are encouraged to use also other packages After successful completion of this course, you will be able to proficiently perform the following tasks in `R`: - **import and preprocess raw data** - **transform data** for modelling - perform **exploratory data analysis** with **summary statistics** and **visualization** - understand, build and evaluate predictive **classification and prediction models**, including **regression** models, **tree-based** models, **ensembles** and **boosted** models - communicate and disseminate results and findings through **reproducible documents**, presentations, websites and **interactive web applications** --- class: middle ## Resources .pull-left70[ .font110[ .content-box-blue[ - Website: <https://brain.cs.uni-magdeburg.de/kmd/DataSciR/> ] .content-box-yellow[ - GitHub: <https://github.com/unmnn/datascir21> ] .content-box-red[ - Moodle: <https://elearning.ovgu.de/course/view.php?id=9799> ] .content-box-green[ - Zoom: <https://ovgu.zoom.us/j/92337441611> ] ] ] .pull-right30[ <img src="figures//00-pile.jpg" width="100%" /> .font80[Photo by [Sharon McCutcheon](https://unsplash.com/photos/tn57JI3CewI)] ] ??? - Website: hub for all materials and news about the course - GitHub: course materials are hosted there (slides, exercises + data) - Moodle: video recordings, quizzes, forum - Zoom: during the first part of the course - weekly meetings with code-alongs + Q&A --- ## Quizzes - Handled on E-Learning (Moodle) - 7 quizzes in total with 5 questions each - Each quiz open on Mondays 00:00 and is due on the following Sunday at 23:59 German time - The first quiz is available between 12.04 00:00 and 18.04 23:59 - Work on them individually - Lowest score is dropped, i.e., only your 6 best results count - Total score of correctly answered questions gives you up to 30 pt for your final grade - Questions must be answered in a fixed order, which means you **cannot** return to previous questions ??? - individual part: - purpose: motivate you to engage with the crash course part of the course --- class: middle background-image: url("figures/00-example-questions.png") background-size: contain --- ## Project In a nutshell: **_Find an interesting dataset and do something with it!_** - analyze a dataset of your choice - demonstrate your proficiency in the techniques we have covered in this course (+ desirably beyond) - **quality over quantity**: don't compute every possible statistic, but : - show that you can **ask meaningful questions and answer them using R** - display that you are **skilled at interpreting and presenting your results** - critically scrutinize your approach (**adequacy**) and your results (**reliability, validity**) and make suggestions for improving your analysis - emphasize on **compelling visualizations** and a **convincing, coherent and clear narrative** - all analyses must be done in R; you are encouraged to use tidyverse packages but you don't have to ??? - team work part - Moodle forum to form teams --- ## The project data - choose any dataset which is _not analyzed to death_ (no Kaggle, UCI ML repo, or similar) - the dataset may already exist or you may collect your own data via a survey or an experiment - the dataset may be based on your personal interests or based previous projects in other courses --- <iframe src="https://brain.cs.uni-magdeburg.de/kmd/DataSciR/project.html" width="100%" height="550px"></iframe> ??? - team size: 3-4 ideal - show example projects - description - project team - milestones - team formation - topic - R Markdown process notebook - Code - project website - screencast - Please ensure a sufficient sound quality. - It may be worthwhile to borrow or invest in an external USB microphone. - What do you feel is the best part of your project? - What insights did you gain? - What is the single most important thing you would like your audience to take away? - Make sure it is upfront and center rather than at the end. - final presentation - submission instructions - hall of fame --- ## Deliverables & Hard Deadlines .pull-left60[ .font130[ Date | Description :------------ | :--------------------------------------------------------- 20.05. ~~16.05.~~ | Team formation & project proposal submission; registration 24.-28.05. | Project proposal feedback 06.07. | Final project submission due 09.07. | Final presentations, exact time and location to be announced ] ] .pull-right40[ <img src="figures//00-time.jpg" width="100%" /> .font80[Photo by [Kevin Ku](https://unsplash.com/photos/aiyBwbrWWlo)] ] --- ## Your final score in the course Your overall course grade will be determined by the following components: 1\. **30 pt**: Weekly quizzes 2\. **70 pt**: Project: - 8 pt: Project proposal - 25 pt: Quality of R Markdown notebook with respect to correctness, comprehensibility and reproducibility - 10 pt: Complexity and level of difficulty of the project - 5 pt: Completeness and overall functionality of the repository and website - 8 pt: Screencast - 14 pt: Final presentation --- ## Zoom meetings We will have weekly Zoom meetings. They are completely optional. The Zoom meetings are mainly for: * code-alongs * Q&A -- Expectations for Zoom meetings: In larger sessions you should - have your microphone muted by default - use the _raise hand_ feature or type questions and comments in the chat In small team sessions you should - have your camera turned on as much as possible - engage with your team mates via voice and text chat - take turns sharing your screens when necessary --- ## Collaboration regulations - Only work in the context of the team project should be completed collaboratively. - Exercises should be completed individually. - Quizzes must be completed individually. - You are welcome to discuss problems in general and ask for advice. --- ## Sharing and reusing code policy - You can use any online resources, but you must explicitly state where you got the code that you use directly or use as inspiration for your solutions. - Any reused code that is not explicitly cited will be treated as plagiarism, regardless of the source. --- ## Late work policy - Submissions after the deadline will not be considered. - General advice: Don't wait until the last minute. - But: Email me if there are reasons beyond your control that prevent timely submission. --- class: last-slide, center, bottom # Thank you! Questions? .courtesy[📷 Photo courtesy of Stefan Berger]