STA2453 - Data Science Methods, Collaboration, and Communication

Class #1

September 10, 2019

Professor Nathan Taback

office: SS6027C

e-mail: nathan.taback@utoronto.ca

office hours: TBD

Course website: https://sta2453.github.io

Course Overview

The primary learning objectives are:

  1. to gain experience using data analysis to extract information.
  2. to gain experience communicating information that arise from a data analysis.

Computational Notebooks

Lots of users love them ...

Ref: https://github.com/parente/nbestimate

But, not everyone ...

What is data analysis?

These quotes are from Tukey's 1962 paper, The Future of Data Analysis.

... data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.

Data analysis ... take on the chracteristics of a science rather than those of mathematics ...

Roger Peng states that

If one were to write down the steps in a data analysis, you might come up with something along these lines of the following list

  • Defining the question
  • Defining the ideal dataset
  • Determining what data you can access
  • Obtaining the data
  • Cleaning the data
  • Exploratory data analysis
  • Statistical prediction/modeling
  • Interpretation of results
  • Challenging of results
  • Synthesis and write up
  • Creating reproducible code

Computing

  • I use R and Python. RStudio and Jupyter notebooks are excellent tools.

  • I expect that all work will be done in a "reproducible" manner.

Data aquisition

  • Experiment
  • Observational study

Data cleaning and wrangling

  • Data cleaning example
In [3]:
library(tidyverse)
library(repr)
options(repr.plot.width=4, repr.plot.height=3)
suppressPackageStartupMessages(library(tidyverse))

agedat <- tibble(age = c(25, 20, 21, 120, 19, 31, 90, 17), sex = c(rep("Male", 4), rep("Female", 4)))
ggplot(agedat, aes(sex, age)) + geom_boxplot()
agedat_clean <- filter(agedat, age <= 100) # remove age > 100
ggplot(agedat_clean, aes(sex, age)) + geom_boxplot()

Go to Python notebook for Python version

Data wrangling (transformation, etc.)

Methods

  • We will cover basic statistical (and some machine learning) methods at a very basic level.

  • But,I expect that you are open to learning new statistical methods with minimal guidance.

Data analysis - Case study

Genetic Activity for Leukemia Patients

(Ref: CASI, pg. 8 - 10)

72 leukemia patients, 47 with ALL (acute lymphoblastic leukemia) and 25 with AML (acute myeloid leuk- emia, a worse prognosis) have each had genetic activity measured for a panel of 7,128 genes.

The data are shown below

In [6]:
leuk_dat <- read_csv("https://web.stanford.edu/~hastie/CASI_files/DATA/leukemia_big.csv")
leuk_dat %>% head()
Warning message:
“Duplicated column names deduplicated: 'ALL' => 'ALL_1' [2], 'ALL' => 'ALL_2' [3], 'ALL' => 'ALL_3' [4], 'ALL' => 'ALL_4' [5], 'ALL' => 'ALL_5' [6], 'ALL' => 'ALL_6' [7], 'ALL' => 'ALL_7' [8], 'ALL' => 'ALL_8' [9], 'ALL' => 'ALL_9' [10], 'ALL' => 'ALL_10' [11], 'ALL' => 'ALL_11' [12], 'ALL' => 'ALL_12' [13], 'ALL' => 'ALL_13' [14], 'ALL' => 'ALL_14' [15], 'ALL' => 'ALL_15' [16], 'ALL' => 'ALL_16' [17], 'ALL' => 'ALL_17' [18], 'ALL' => 'ALL_18' [19], 'ALL' => 'ALL_19' [20], 'AML' => 'AML_1' [22], 'AML' => 'AML_2' [23], 'AML' => 'AML_3' [24], 'AML' => 'AML_4' [25], 'AML' => 'AML_5' [26], 'AML' => 'AML_6' [27], 'AML' => 'AML_7' [28], 'AML' => 'AML_8' [29], 'AML' => 'AML_9' [30], 'AML' => 'AML_10' [31], 'AML' => 'AML_11' [32], 'AML' => 'AML_12' [33], 'AML' => 'AML_13' [34], 'ALL' => 'ALL_20' [35], 'ALL' => 'ALL_21' [36], 'ALL' => 'ALL_22' [37], 'ALL' => 'ALL_23' [38], 'ALL' => 'ALL_24' [39], 'ALL' => 'ALL_25' [40], 'ALL' => 'ALL_26' [41], 'ALL' => 'ALL_27' [42], 'ALL' => 'ALL_28' [43], 'ALL' => 'ALL_29' [44], 'ALL' => 'ALL_30' [45], 'ALL' => 'ALL_31' [46], 'ALL' => 'ALL_32' [47], 'ALL' => 'ALL_33' [48], 'ALL' => 'ALL_34' [49], 'ALL' => 'ALL_35' [50], 'ALL' => 'ALL_36' [51], 'ALL' => 'ALL_37' [52], 'ALL' => 'ALL_38' [53], 'ALL' => 'ALL_39' [54], 'ALL' => 'ALL_40' [55], 'ALL' => 'ALL_41' [56], 'ALL' => 'ALL_42' [57], 'ALL' => 'ALL_43' [58], 'ALL' => 'ALL_44' [59], 'ALL' => 'ALL_45' [60], 'ALL' => 'ALL_46' [61], 'AML' => 'AML_14' [62], 'AML' => 'AML_15' [63], 'AML' => 'AML_16' [64], 'AML' => 'AML_17' [65], 'AML' => 'AML_18' [66], 'AML' => 'AML_19' [67], 'AML' => 'AML_20' [68], 'AML' => 'AML_21' [69], 'AML' => 'AML_22' [70], 'AML' => 'AML_23' [71], 'AML' => 'AML_24' [72]”Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.
A tibble: 6 × 72
ALLALL_1ALL_2ALL_3ALL_4ALL_5ALL_6ALL_7ALL_8ALL_9AML_15AML_16AML_17AML_18AML_19AML_20AML_21AML_22AML_23AML_24
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
-1.5336217-0.8676096-0.4331719-1.6719032-1.1876894-1.1272336-1.0454091-0.1069170-1.1987957-1.1908987-0.43665029-1.27470781-0.68145848-0.8766099-0.6240218-0.4316276-1.4352588-0.6719538-1.0131609-0.9694822
-1.2356729-1.2755005-1.1844922-1.5964240-1.3352557-1.1137304-0.8008802-0.7451766-0.8493121-1.1908987-0.91548270-1.35436266-0.65355897-1.0962496-1.0665942-1.3352557-1.2045864-0.7514571-0.8895921-1.0809876
-0.3339829 0.3759265-0.4591960-1.4225714-0.7974929-1.3627683-0.6719538-1.1756743 0.3208134 0.6466095-0.73615632-0.02215325-0.03745523-0.5673348-1.1007494-0.5529381-0.9488738-0.2316568-0.7421631-0.7795000
0.4887021 0.4440110 0.4362635 0.1933529 0.2356315-0.3603116 0.1849409 0.4256533 0.3339829 0.2352700 0.08378111 0.35656224 0.41624076 0.5339862 0.2275053 0.4168160 0.4082022 0.3265565 0.3618128 0.2988635
-1.3008933-1.2296598-1.3258824-1.8183288-1.3112060-1.5139747-1.6516236-1.3395553-0.5931317 0.1333023-1.54744405-1.26447453-1.51231763-1.4695825-1.2834722-0.9776716-1.0901780-1.5451198-1.1742718-1.4431827
-1.6826682-1.6420718-1.4072639-1.7444693-1.6543805-1.7776193-1.8056201-1.7078182-1.7033012-1.0466244-1.70180325-1.83891357-1.58768329-1.7493204-1.6826682-1.4706185-1.4858212-1.7616315-1.5790622-1.6366803

The histograms and boxplots below show the distributions for gene 48 in the ALL and AML groups.

Exercise #1

Create the histogram and boxplots above using R or Python. Interpret each plot. Which plot do you prefer to compare the distributions? Why?

Is there a difference in gene 48 between patients with ALL (acute lymphoblastic leukemia) compared to AML(acute myeloid leukemia, a worse prognosis)?

A tibble: 1 × 9
estimate1estimate2statisticp.valueparameterconf.lowconf.highmethodalternative
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
0.30387530.5307917-3.3047230.00150108570-0.363863-0.08996983 Two Sample t-testtwo.sided

Exercise #2

Use R or Python to calculate a p-value for a test comparing the two groups. Interpret the p-value.

This is one candidate gene out of 7,128 genes. Is $t=-3.30$ unusual?

A tibble: 1 × 3
ncountpercent
<int><int><dbl>
71283464.85

Exercise #3

Create the histogram above using R or Python. Write a few sentences describing how the histogram above was generated. Is gene48 unusual compared to other genes?

Statistical Thinking and Hypothesis Testing

  • The general ideas behind testing hypotheses permeate "statistical thinking".
  • The core of statistical thinking includes the appreciation of uncertaintity and variability.
  • Hospital A has 20 births per day and Hospital B has 50 births per day. During a period one year which hospital is more likely to have more than 60% boys born?

Exercise #4

During a period two years which hospital is more likely to have more than 60% boys born? Use R or Python to create similar histograms that show the distributions of boys in a large and small hospital.

Hypothesis Testing

  • Let's look at the expression of gene 48 between the two groups, ALL and AML.
  • Assume that there is no difference in gene expression between the two groups. This means that the same gene expression value for a patient would occur if that patient is in the ALL or AML groups.
In [15]:
set.seed(100)
gene_48 %>% select(value, group) %>% sample_n(5)

gene_48 %>% group_by(group) %>% 
  summarise(n = n(), mean = mean(value), sd = sd(value))  %>%
  mutate(diff = mean - lag(mean))
A tibble: 5 × 2
valuegroup
<dbl><chr>
0.7433214AML
0.5228681AML
0.4015234ALL
-0.1060330ALL
0.2423247ALL
A tibble: 2 × 5
groupnmeansddiff
<chr><int><dbl><dbl><dbl>
ALL470.30387530.2574523 NA
AML250.53079170.31205150.2269164
  • Under the assumption of no difference how likely is it that we would observe a difference of -3.30 (the observed difference)?
  • Under the assumption of no difference the group labels AML and ALL are interchangeable. The number of permutations of the lables is the number of ways that the observed data could have been generated under the assumption of no difference?
  • In this case there are ${{47+25} \choose 47} = 1.5 \times 10^{19}$.
  • Instead of evaluating all $1.5 \times 10^{19}$ we can create a resampled permutation of the data set where we permute the group label. Then repeat this, say, 10000 times. (see pages 49-51, CASI)
In [19]:
options(repr.plot.width=6, repr.plot.height=3)
tibble(rs_dist) %>% 
  ggplot(aes(rs_dist)) + geom_histogram(colour = "black", fill = "grey", bins = 25) +
  geom_vline(xintercept = -0.2269164, colour = "red") +
  ggtitle("Permutation Null Distribution of Mean Difference") + xlab("Mean Difference") +
  geom_text(aes(-0.2269164, 0, label = "-0.23", hjust = 1, vjust = 1.2), size = 3)


tibble(rs_dist) %>% summarise(pvalu = sum(rs_dist <= -0.2269164)/Numsim)
A tibble: 1 × 1
pvalu
<dbl>
5e-04

Exercise #5

Use R or Python to create the permutation distribution. Calculate the two-sided P-value? Briefly interpret the two-sided P-value.