STA2453 - Data Science Methods, Collaboration, and Communication¶

Class #1¶

September 10, 2019¶

Professor Nathan Taback

office: SS6027C

e-mail: nathan.taback@utoronto.ca

office hours: TBD

Course website: https://sta2453.github.io

Course Overview¶

The primary learning objectives are:

to gain experience using data analysis to extract information.
to gain experience communicating information that arise from a data analysis.

Computational Notebooks¶

Lots of users love them ...

Ref: https://github.com/parente/nbestimate

But, not everyone ...

What is data analysis?¶

These quotes are from Tukey's 1962 paper, The Future of Data Analysis.

... data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.

Data analysis ... take on the chracteristics of a science rather than those of mathematics ...

Roger Peng states that

If one were to write down the steps in a data analysis, you might come up with something along these lines of the following list

Defining the question

Defining the ideal dataset

Determining what data you can access

Obtaining the data

Cleaning the data

Exploratory data analysis

Statistical prediction/modeling

Interpretation of results

Challenging of results

Synthesis and write up

Creating reproducible code

Computing¶

I use R and Python. RStudio and Jupyter notebooks are excellent tools.
I expect that all work will be done in a "reproducible" manner.

Data aquisition¶

Experiment
Observational study

Data cleaning and wrangling¶

Data cleaning example

library(tidyverse)
library(repr)
options(repr.plot.width=4, repr.plot.height=3)
suppressPackageStartupMessages(library(tidyverse))

agedat <- tibble(age = c(25, 20, 21, 120, 19, 31, 90, 17), sex = c(rep("Male", 4), rep("Female", 4)))
ggplot(agedat, aes(sex, age)) + geom_boxplot()
agedat_clean <- filter(agedat, age <= 100) # remove age > 100
ggplot(agedat_clean, aes(sex, age)) + geom_boxplot()

Go to Python notebook for Python version

Data wrangling (transformation, etc.)¶

Methods¶

We will cover basic statistical (and some machine learning) methods at a very basic level.
But,I expect that you are open to learning new statistical methods with minimal guidance.

Data analysis - Case study¶

Genetic Activity for Leukemia Patients¶

(Ref: CASI, pg. 8 - 10)

72 leukemia patients, 47 with ALL (acute lymphoblastic leukemia) and 25 with AML (acute myeloid leuk- emia, a worse prognosis) have each had genetic activity measured for a panel of 7,128 genes.

The data are shown below

leuk_dat <- read_csv("https://web.stanford.edu/~hastie/CASI_files/DATA/leukemia_big.csv")
leuk_dat %>% head()

Warning message:
“Duplicated column names deduplicated: 'ALL' => 'ALL_1' [2], 'ALL' => 'ALL_2' [3], 'ALL' => 'ALL_3' [4], 'ALL' => 'ALL_4' [5], 'ALL' => 'ALL_5' [6], 'ALL' => 'ALL_6' [7], 'ALL' => 'ALL_7' [8], 'ALL' => 'ALL_8' [9], 'ALL' => 'ALL_9' [10], 'ALL' => 'ALL_10' [11], 'ALL' => 'ALL_11' [12], 'ALL' => 'ALL_12' [13], 'ALL' => 'ALL_13' [14], 'ALL' => 'ALL_14' [15], 'ALL' => 'ALL_15' [16], 'ALL' => 'ALL_16' [17], 'ALL' => 'ALL_17' [18], 'ALL' => 'ALL_18' [19], 'ALL' => 'ALL_19' [20], 'AML' => 'AML_1' [22], 'AML' => 'AML_2' [23], 'AML' => 'AML_3' [24], 'AML' => 'AML_4' [25], 'AML' => 'AML_5' [26], 'AML' => 'AML_6' [27], 'AML' => 'AML_7' [28], 'AML' => 'AML_8' [29], 'AML' => 'AML_9' [30], 'AML' => 'AML_10' [31], 'AML' => 'AML_11' [32], 'AML' => 'AML_12' [33], 'AML' => 'AML_13' [34], 'ALL' => 'ALL_20' [35], 'ALL' => 'ALL_21' [36], 'ALL' => 'ALL_22' [37], 'ALL' => 'ALL_23' [38], 'ALL' => 'ALL_24' [39], 'ALL' => 'ALL_25' [40], 'ALL' => 'ALL_26' [41], 'ALL' => 'ALL_27' [42], 'ALL' => 'ALL_28' [43], 'ALL' => 'ALL_29' [44], 'ALL' => 'ALL_30' [45], 'ALL' => 'ALL_31' [46], 'ALL' => 'ALL_32' [47], 'ALL' => 'ALL_33' [48], 'ALL' => 'ALL_34' [49], 'ALL' => 'ALL_35' [50], 'ALL' => 'ALL_36' [51], 'ALL' => 'ALL_37' [52], 'ALL' => 'ALL_38' [53], 'ALL' => 'ALL_39' [54], 'ALL' => 'ALL_40' [55], 'ALL' => 'ALL_41' [56], 'ALL' => 'ALL_42' [57], 'ALL' => 'ALL_43' [58], 'ALL' => 'ALL_44' [59], 'ALL' => 'ALL_45' [60], 'ALL' => 'ALL_46' [61], 'AML' => 'AML_14' [62], 'AML' => 'AML_15' [63], 'AML' => 'AML_16' [64], 'AML' => 'AML_17' [65], 'AML' => 'AML_18' [66], 'AML' => 'AML_19' [67], 'AML' => 'AML_20' [68], 'AML' => 'AML_21' [69], 'AML' => 'AML_22' [70], 'AML' => 'AML_23' [71], 'AML' => 'AML_24' [72]”Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.

The histograms and boxplots below show the distributions for gene 48 in the ALL and AML groups.

Exercise #1¶

Create the histogram and boxplots above using R or Python. Interpret each plot. Which plot do you prefer to compare the distributions? Why?

Is there a difference in gene 48 between patients with ALL (acute lymphoblastic leukemia) compared to AML(acute myeloid leukemia, a worse prognosis)?

Exercise #2¶

Use R or Python to calculate a p-value for a test comparing the two groups. Interpret the p-value.

This is one candidate gene out of 7,128 genes. Is $t=-3.30$ unusual?

Exercise #3¶

Create the histogram above using R or Python. Write a few sentences describing how the histogram above was generated. Is gene48 unusual compared to other genes?

Statistical Thinking and Hypothesis Testing¶

The general ideas behind testing hypotheses permeate "statistical thinking".
The core of statistical thinking includes the appreciation of uncertaintity and variability.
Hospital A has 20 births per day and Hospital B has 50 births per day. During a period one year which hospital is more likely to have more than 60% boys born?

Exercise #4¶

During a period two years which hospital is more likely to have more than 60% boys born? Use R or Python to create similar histograms that show the distributions of boys in a large and small hospital.

Hypothesis Testing¶

Let's look at the expression of gene 48 between the two groups, ALL and AML.
Assume that there is no difference in gene expression between the two groups. This means that the same gene expression value for a patient would occur if that patient is in the ALL or AML groups.

set.seed(100)
gene_48 %>% select(value, group) %>% sample_n(5)

gene_48 %>% group_by(group) %>% 
  summarise(n = n(), mean = mean(value), sd = sd(value))  %>%
  mutate(diff = mean - lag(mean))

Under the assumption of no difference how likely is it that we would observe a difference of -3.30 (the observed difference)?
Under the assumption of no difference the group labels AML and ALL are interchangeable. The number of permutations of the lables is the number of ways that the observed data could have been generated under the assumption of no difference?
In this case there are ${{47+25} \choose 47} = 1.5 \times 10^{19}$.
Instead of evaluating all $1.5 \times 10^{19}$ we can create a resampled permutation of the data set where we permute the group label. Then repeat this, say, 10000 times. (see pages 49-51, CASI)

options(repr.plot.width=6, repr.plot.height=3)
tibble(rs_dist) %>% 
  ggplot(aes(rs_dist)) + geom_histogram(colour = "black", fill = "grey", bins = 25) +
  geom_vline(xintercept = -0.2269164, colour = "red") +
  ggtitle("Permutation Null Distribution of Mean Difference") + xlab("Mean Difference") +
  geom_text(aes(-0.2269164, 0, label = "-0.23", hjust = 1, vjust = 1.2), size = 3)


tibble(rs_dist) %>% summarise(pvalu = sum(rs_dist <= -0.2269164)/Numsim)

Exercise #5¶

Use R or Python to create the permutation distribution. Calculate the two-sided P-value? Briefly interpret the two-sided P-value.

ALL	ALL_1	ALL_2	ALL_3	ALL_4	ALL_5	ALL_6	ALL_7	ALL_8	ALL_9	⋯	AML_15	AML_16	AML_17	AML_18	AML_19	AML_20	AML_21	AML_22	AML_23	AML_24
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	⋯	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
-1.5336217	-0.8676096	-0.4331719	-1.6719032	-1.1876894	-1.1272336	-1.0454091	-0.1069170	-1.1987957	-1.1908987	⋯	-0.43665029	-1.27470781	-0.68145848	-0.8766099	-0.6240218	-0.4316276	-1.4352588	-0.6719538	-1.0131609	-0.9694822
-1.2356729	-1.2755005	-1.1844922	-1.5964240	-1.3352557	-1.1137304	-0.8008802	-0.7451766	-0.8493121	-1.1908987	⋯	-0.91548270	-1.35436266	-0.65355897	-1.0962496	-1.0665942	-1.3352557	-1.2045864	-0.7514571	-0.8895921	-1.0809876
-0.3339829	0.3759265	-0.4591960	-1.4225714	-0.7974929	-1.3627683	-0.6719538	-1.1756743	0.3208134	0.6466095	⋯	-0.73615632	-0.02215325	-0.03745523	-0.5673348	-1.1007494	-0.5529381	-0.9488738	-0.2316568	-0.7421631	-0.7795000
0.4887021	0.4440110	0.4362635	0.1933529	0.2356315	-0.3603116	0.1849409	0.4256533	0.3339829	0.2352700	⋯	0.08378111	0.35656224	0.41624076	0.5339862	0.2275053	0.4168160	0.4082022	0.3265565	0.3618128	0.2988635
-1.3008933	-1.2296598	-1.3258824	-1.8183288	-1.3112060	-1.5139747	-1.6516236	-1.3395553	-0.5931317	0.1333023	⋯	-1.54744405	-1.26447453	-1.51231763	-1.4695825	-1.2834722	-0.9776716	-1.0901780	-1.5451198	-1.1742718	-1.4431827
-1.6826682	-1.6420718	-1.4072639	-1.7444693	-1.6543805	-1.7776193	-1.8056201	-1.7078182	-1.7033012	-1.0466244	⋯	-1.70180325	-1.83891357	-1.58768329	-1.7493204	-1.6826682	-1.4706185	-1.4858212	-1.7616315	-1.5790622	-1.6366803

estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
0.3038753	0.5307917	-3.304723	0.001501085	70	-0.363863	-0.08996983	Two Sample t-test	two.sided

n	count	percent
<int>	<int>	<dbl>
7128	346	4.85

group	n	mean	sd	diff
<chr>	<int>	<dbl>	<dbl>	<dbl>
ALL	47	0.3038753	0.2574523	NA
AML	25	0.5307917	0.3120515	0.2269164

value	group
<dbl>	<chr>
0.7433214	AML
0.5228681	AML
0.4015234	ALL
-0.1060330	ALL
0.2423247	ALL

A tibble: 1 × 1
pvalu
<dbl>
5e-04