Course Description

Students will gain experience with the data science process including:

  • Data collection
  • Data wrangling
  • Program (Python/R)
  • Data exploration
  • Data vizualization
  • Modelling
  • Communication
  • Reproducibility

Students will learn about these topics by working on case studies based on problems that data scientists face in industry and academic research. Many of the cases will involve data collected by an organization (e.g., organization or scientist), using published data, or scraping web pages. All projects will involve some type of collaboration or communication. Students are expected to be familiar with the application of basic statistical methods used for inference (e.g., general linear models), prediction (e.g., linear and logistic regression), and are comfortable with basic data analysis using a programming language such as R or Python. Students will be expected to adopt a reproducible research workflow using tools such as Github, and RMarkdown, or Jupyter.

Class time will be a mixture of informal lectures, class discussions, and student presentations.

Evaluation

All work will be graded on a scale from 1 to 4 (sometimes with pluses and minuses) where:

Grade value Description
1 Work does not meet expectations.
2 Work meets expectations minimally, possibly missing some.
3 Good work; meets all or most expectations.
4 Excellent work; exceeds expectations.

Grades will almost always be 2 or 3 (1’s and 4’s are rare). Generally speaking, a 2 is a B, a 3 is an A, and a 4 is an A+.

Item Description Value
In-class Labs
Lab #1 5.00%
Lab #2 5.00%
Lab #3 5.00%
Projects
Project #1 25.00%
Project #2 25.00%
Project #3 25.00%
Reflection on projects Written reflection 5.00%
Participation Attendance, active in discussions, and prepared for class 5.00%

Course Schedule

This is a half-credit that meets in both the fall and winter terms. Class meeting will occur approximately bi-weekly.

Class Date Description Reading Due
1 10-Sep Introduction to course, hypothesis testing in data analysis ISLR: Chapt. 2, 3.1. CASI: Chapt 1, 2
17-Sep No class
2 24-Sep Multiple linear regression ISLR: 3.2-3.6. In-class lab #1
01-Oct No class In-class lab #2
3 08-Oct Logistic regression and other linear models (GLM) ISLR: Chapt 4; CASI: Chapt 8
15-Oct No class In-class lab #3
4 22-Oct No class
29-Oct Introduction of Project #1 TBD
05-Nov Fall Reading Week - No class
5 12-Nov Data visualization
19-Nov No class
6 26-Nov Office hours
03-Dec Student presentations for project #1 Project #1 presentations
10-Dec Winter Break
17-Dec Winter Break
24-Dec Winter Break
31-Dec Winter Break
07-Jan No class
7 14-Jan Introduction to Diabetes Complications Project #2 Description will be posted after class
21-Jan Introduction to RCT Fraud Project #2 Read descriptions posted on Quercus
8 28-Jan No class
04-Feb Office hours
9 11-Feb Student presentations for project #2 Project #2 presentations
18-Feb Reading Week
10 25-Feb Introduction of Project #3 TBD
03-Mar No class
11 10-Mar Guest lecture: Dr. Inmar Givoni, Uber ATG. Deep Learning for self driving
17-Mar No class
12 24-Mar Student presentations for project #3 Project #3 presentations
31-Mar No class
07-Apr No class

Required Readings

ISLR (James et al. (2013)) and CASI (Efron and Hastie (2016)) are both are freely available.

Other readings may be assigned during the course.

Course Books

Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference. Vol. 5. Cambridge University Press.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.