Professor Nathan Taback
office: SS6027C
e-mail: nathan.taback@utoronto.ca
office hours: TBD
Course website: https://sta2453.github.io
The primary learning objectives are:
But, not everyone ...
These quotes are from Tukey's 1962 paper, The Future of Data Analysis.
... data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
Data analysis ... take on the chracteristics of a science rather than those of mathematics ...
Roger Peng states that
If one were to write down the steps in a data analysis, you might come up with something along these lines of the following list
- Defining the question
- Defining the ideal dataset
- Determining what data you can access
- Obtaining the data
- Cleaning the data
- Exploratory data analysis
- Statistical prediction/modeling
- Interpretation of results
- Challenging of results
- Synthesis and write up
- Creating reproducible code
I use R and Python. RStudio and Jupyter notebooks are excellent tools.
I expect that all work will be done in a "reproducible" manner.
library(tidyverse)
library(repr)
options(repr.plot.width=4, repr.plot.height=3)
suppressPackageStartupMessages(library(tidyverse))
agedat <- tibble(age = c(25, 20, 21, 120, 19, 31, 90, 17), sex = c(rep("Male", 4), rep("Female", 4)))
ggplot(agedat, aes(sex, age)) + geom_boxplot()
agedat_clean <- filter(agedat, age <= 100) # remove age > 100
ggplot(agedat_clean, aes(sex, age)) + geom_boxplot()
Go to Python notebook for Python version
We will cover basic statistical (and some machine learning) methods at a very basic level.
But,I expect that you are open to learning new statistical methods with minimal guidance.
(Ref: CASI, pg. 8 - 10)
72 leukemia patients, 47 with ALL (acute lymphoblastic leukemia) and 25 with AML (acute myeloid leuk- emia, a worse prognosis) have each had genetic activity measured for a panel of 7,128 genes.
The data are shown below
leuk_dat <- read_csv("https://web.stanford.edu/~hastie/CASI_files/DATA/leukemia_big.csv")
leuk_dat %>% head()
The histograms and boxplots below show the distributions for gene 48 in the ALL and AML groups.
Create the histogram and boxplots above using R or Python. Interpret each plot. Which plot do you prefer to compare the distributions? Why?
Is there a difference in gene 48 between patients with ALL
(acute lymphoblastic leukemia) compared to AML
(acute myeloid leukemia, a worse prognosis)?
Use R or Python to calculate a p-value for a test comparing the two groups. Interpret the p-value.
This is one candidate gene out of 7,128 genes. Is $t=-3.30$ unusual?
Create the histogram above using R or Python. Write a few sentences describing how the histogram above was generated. Is gene48 unusual compared to other genes?
During a period two years which hospital is more likely to have more than 60% boys born? Use R or Python to create similar histograms that show the distributions of boys in a large and small hospital.
set.seed(100)
gene_48 %>% select(value, group) %>% sample_n(5)
gene_48 %>% group_by(group) %>%
summarise(n = n(), mean = mean(value), sd = sd(value)) %>%
mutate(diff = mean - lag(mean))
AML
and ALL
are interchangeable. The number of permutations of the lables is the number of ways that the observed data could have been generated under the assumption of no difference?options(repr.plot.width=6, repr.plot.height=3)
tibble(rs_dist) %>%
ggplot(aes(rs_dist)) + geom_histogram(colour = "black", fill = "grey", bins = 25) +
geom_vline(xintercept = -0.2269164, colour = "red") +
ggtitle("Permutation Null Distribution of Mean Difference") + xlab("Mean Difference") +
geom_text(aes(-0.2269164, 0, label = "-0.23", hjust = 1, vjust = 1.2), size = 3)
tibble(rs_dist) %>% summarise(pvalu = sum(rs_dist <= -0.2269164)/Numsim)
Use R or Python to create the permutation distribution. Calculate the two-sided P-value? Briefly interpret the two-sided P-value.