Regression Basics Part 2

Adam J Sullivan
Assistant Professor of Biostatistics
Brown University

One Continuous

One Continuous Covariate

We will consider one continuous covariate.
We will consider year.

Example: Year and Appearances

Consider the effect of year on appearances.
With categorical data we plotted this with box-whisker plots.
We can now use a scatter plot

Scatter Plot: Year and Appearances

ggplot(comic_characters, aes(year, appearances)) + 
  geom_point() + 
  geom_smooth(method="lm") + 
  xlab("Year") + 
  ylab("Appearances")

Scatter Plot: Year and Appearances

plot of chunk unnamed-chunk-2

Modeling What We See

There might not be a connection or there might be a very small one, let's explore further.
How can we do this?
How does linear regression work?

How do we Quantify this?

One way we could quantify this is \[\mu_{y|x} = \beta_0 + \beta_1X\]
where
- \(\mu_{y|x}\) is the mean time for those whose year is \(x\).
- \(\beta_0\) is the \(y\)-intercept (mean value of \(y\) when \(x=0\), \(\mu_y|0\))
- \(\beta_1\) is the slope (change in mean value of \(Y\) corresponding to 1 unit increase in \(x\)).

Population Regression Line

With the population regression line we have that the distribution of appearances for those at a particular year, \(x\), is approximately normal with mean, \(\mu_{y|x}\), and standard deviation, \(\sigma_{y|x}\).

Population Regression Line

Distribution of Y and different levels of X.

Population Regression Line

This shows the scatter about the mean due to natural variation. To accommodate this scatter we fit a regression model with 2 parts:
- Systematic Part
- Random Part

The Model

This leads to the model \[Y = \beta_0 + \beta_1X + \varepsilon\]
Where \(\beta_0+\beta_1X\) is the systematic part of the model and implies that \[E(Y|X=x) = \mu_{y|x} = \beta_0 + \beta_1x\]
the variation part where we have \(\varepsilon\sim N(0,\sigma^2)\) which is independent of \(X\).

What do We Have?

Consider the scenario where we have \(n\) subjects and for each subject we have the data points \((x,y)\).
This leads to us having data in the form \((X_i,Y_i)\) for \(i=1,\ldots,n\).
Then we have the model: \[Y_i = \beta_0 + \beta_1X_i + \varepsilon_i\]
\(Y_i|X_i \sim N\left(\beta_0 + \beta_1X_i , \sigma^2\right)\)
\(E(Y_i|X_i) = \mu_{y|x} = \beta_0 + \beta_1X_i\)
\(Var(Y|X_i ) = \sigma^2\)

Picture of this

plot of chunk unnamed-chunk-3

What Does This Tell Us?

We can refer back to our scatter plot now and discuss what is the "best" line.
Given the previous image we can see that a good estimator would somehow have smaller residual errors.
So the "best" line would minimize the errors.

Residual Errors

plot of chunk unnamed-chunk-4

In Comes Least Squares

The least squares estimator of regression coefficients in the estimator that minimizes the sum of squared errors.
We denote these estimators as \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
In other words we attempt to minimize \[\sum_{i=1}^n \left(\varepsilon_i\right)^2 = \sum_{i=1}^n \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1X_i\right)^2\]

Inferences on OLS

Once we have our intercept and slope estimators the next step is to determine if they are significant or not.
Typically with hypothesis testing we have needed the following:
- Population/Assumed Value of interest
- Estimated value
- Standard error of Estimate

Confidence Interval Creation

with 95% confidence intervals of \[\hat{\beta}_1 \pm t_{n-2, 0.975} \cdot se\left(\hat{\beta}_1\right)\] \[\hat{\beta}_0 \pm t_{n-2, 0.975} \cdot se\left(\hat{\beta}_0\right)\]
In general we can find a \(100(1-\alpha)\%\) confidence interval as \[\hat{\beta}_1 \pm t_{n-2, 1-\dfrac{\alpha}{2}} \cdot se\left(\hat{\beta}_1\right)\] \[\hat{\beta}_0 \pm t_{n-2, 1-\dfrac{\alpha}{2}} \cdot se\left(\hat{\beta}_0\right)\]

Example: Year and Appearances

model <- lm(appearances~year, data=comic_characters)
tidy(model, conf.int=TRUE)[,-c(3:4)]
glance(model)

Example: Year and Appearances

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>   <dbl>  <dbl>
## 1    0.0146        0.0145  93.8      313. 1.74e-69     2 -1.26e5 2.52e5
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Example: Year and Appearances

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>   <dbl>  <dbl>
## 1    0.0146        0.0145  93.8      313. 1.74e-69     2 -1.26e5 2.52e5
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Interpreting the Coefficients: Continuous

Before we can discuss the regression coefficients we need to understand how to interpret what these coefficients mean.
\(\beta_0\) is mean value for \(Y\) when \(X=0\).
What about \(\beta_1\)?

Interpreting the Coefficients: Continuous

Then we consider \(\beta_1\) to see the meaning of this we do the following \[ \begin{aligned} E(Y|X=x+1) - E(Y|X=x) &= \beta_0 + \beta_1(x+1) - \beta_0 - \beta_1x\\ &= \beta_1 \end{aligned} \]

Interpreting the Coefficients: Continuous

We consider \(\beta_0\) first.
Does this value have meaning with our current data?
- The estimated value of time level is only applicable to year within the range of our data.
- Many times the intercept is scientifically meaningless.
- Even if meaningless on its own, \(\beta_0\) is necessary to specify the equation of our regression line.
- Note: People do sometimes use mean centered data and the intercept is then interpretable.

Interpreting the Coefficients: Continuous

This gives us the interpretation that \(\beta_1\) represents the mean change in outcome \(Y\) given a one unit increase in predictor \(X\).
This is not an actual prescription though, this is considering different subjects or groups of subjects who differ by one unit.
Below are correct interpretations of \(\beta_1\) in our example.
- These results display that the mean difference in appearances for a 1 year difference is -0.596
- These results display that the mean difference in time for a 10 year difference is -5.96

Multiple Regression

We have been discussing simple models so far.
This works well when you have:
- Randomized Data to test between specific groups (Treatment vs Control)
In most situations we need look at more than just one relationship.
Think of this as needing more information to tell the entire story.

Multiple Linear Regression with appearances

First start with univariate models
Then perform the multiple model

Multivariate Models

mod3 <- lm(appearances~publisher + year, data=comic_characters)
tidy3 <- tidy(mod3, conf.int=T)[,-c(3:4)]
tidy3

## # A tibble: 3 x 5
##   term             estimate  p.value  conf.low conf.high
##   <chr>               <dbl>    <dbl>     <dbl>     <dbl>
## 1 (Intercept)      1265.    9.81e-78  1133.     1398.   
## 2 publisherMarvel    -9.54  1.24e-11   -12.3      -6.78 
## 3 year               -0.624 5.93e-75    -0.690    -0.557

Interpreting Multiple Coefficients

The intercept is when all coefficients are zero.
Each other coefficient is interpreted in context to another.

Interpreting Multiple Coefficients: Our Example

Intercept: DC average appearances at year 0. Not Meaningful
Publisher Coefficient: If we consider 2 characters in the same year, the character from Marvel will have on average 9.54 less appearances than the character from DC.
Year Coefficient: If we consider 2 characters from the same publisher, an increase in 1 year will lead to on average 0.62 less appearances.

Multiple Regression

We have been discussing simple models so far.
This works well when you have:
- Randomized Data to test between specific groups (Treatment vs Control)
In most situations we need look at more than just one relationship.
Think of this as needing more information to tell the entire story.

Motivating Example

Health disparities are very real and exist across individuals and populations.
Before developing methods of remedying these disparities we need to be able to identify where there are disparities.In this homework we will consider a study by (Asch & Armstrong, 2007).
This paper considers 222 patients with localized prostate cancer.

Motivating Example

The table below partitions patients by race, hospital and whether or not the patient received a prostatectomy.

	Race	Prostatectomy	No Prostatectomy
University Hospital	White	54	37
	Black	7	5
VA Hospital	White	11	29
	Black	22	57

Loading the Data

You can load this data into R with the code below:

phil_disp <- read.table("https://drive.google.com/uc?export=download&id=0B8CsRLdwqzbzOXlIRl9VcjNJRFU", header=TRUE, sep=",")

The Data

This dataset contains the following variables:

Variable	Description
hospital	0 - University Hospital
	1 - VA Hospital
race	0 - White
	1 - Black
surgery	0 - No prostatectomy
	1 - Had Prostatectomy
number	Count of people in Category

Consider Prostatectomy by Race

library(broom)
prost_race <- glm(surgery ~ race, weight=number, data= phil_disp,
                  family="binomial")
tidy(prost_race, exponentiate=T, conf.int=T)[,-c(3:4)]

## # A tibble: 2 x 5
##   term        estimate p.value conf.low conf.high
##   <chr>          <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.985 0.930      0.699     1.39 
## 2 race           0.475 0.00895    0.269     0.825

Consider Prostatectomy by Race

What can we conclude?
What kind of policy might we want to invoke based on this discovery?

Consider Prostatectomy by Hospital

prost_hosp <- glm(surgery ~ hospital, weight=number, data= phil_disp,
                  family="binomial")
tidy(prost_hosp, exponentiate =T, conf.int=T)[,-c(3:4)]

## # A tibble: 2 x 5
##   term        estimate    p.value conf.low conf.high
##   <chr>          <dbl>      <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.45  0.0627        0.984     2.16 
## 2 hospital       0.264 0.00000341    0.149     0.460

Consider Prostatectomy by Hospital

What can we conclude?

Multiple Regression of Prostatectomy

prost <- glm(surgery ~ hospital + race, weight=number, data= phil_disp,
             family="binomial")
tidy(prost, exponentiate=T, conf.int=T)[,-c(3:4)]

## # A tibble: 3 x 5
##   term        estimate  p.value conf.low conf.high
##   <chr>          <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.45  0.0682      0.976     2.18 
## 2 hospital       0.264 0.000124    0.131     0.515
## 3 race           0.998 0.996       0.501     2.04

Multiple Regression of Prostatectomy

What can We conclude?
What happened here?
Does this change our policy suggestion from before?

Benefits of Multiple Regression

Multiple Regression helps us tell a more complete story.
Multiple regression controls for confounding.

Confounding

Associated with both the Exposure and the Outcome
Even if the Exposure and Outcome are not related, unmeasured confounding can show that they are.

What Do We Do with Confounding?

We must add all confounders into our model.
Without adjusting for confounders are results may be highly biased.
Without adjusting for confounding we may make incorrect policies that do not fix the problem.

Multiple Linear Regression with appearances

First start with univariate models
Then perform the multiple model

Multivariate Models

library(broom)
library(fivethirtyeight)
mod3 <- lm(appearances~publisher + year, data=comic_characters)
tidy3 <- tidy(mod3, conf.int=T)[,-c(3:4)]
tidy3

## # A tibble: 3 x 5
##   term             estimate  p.value  conf.low conf.high
##   <chr>               <dbl>    <dbl>     <dbl>     <dbl>
## 1 (Intercept)      1265.    9.81e-78  1133.     1398.   
## 2 publisherMarvel    -9.54  1.24e-11   -12.3      -6.78 
## 3 year               -0.624 5.93e-75    -0.690    -0.557

Interpreting Multiple Coefficients

The intercept is when all coefficients are zero.
Each other coefficient is interpreted in context to another.

Interpreting Multiple Coefficients: Our Example

Intercept: DC average appearances at year 0. Not Meaningful
Publisher Coefficient: If we consider 2 characters in the same year, the character from Marvel will have on average 9.54 less appearances than the character from DC.
Year Coefficient: If we consider 2 characters from the same publisher, an increase in 1 year will lead to on average 0.62 less appearances.

Further Example: Bike Sharing Data

We have hourly data spanning 2 years
This dataset has the first 19 days of each month.
Goal is to find the total count of bike rented.

Further Example: Bike Sharing Data

Data	Fields
datetime	hourly date + timestamp
season	1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday	whether the day is considered a holiday
workingday	whether the day is neither a weekend nor holiday

Further Example: Bike Sharing Data

Data	Fields
weather	1: Clear, Few clouds, Partly cloudy, Partly cloudy
	2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
	3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
	4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp	temperature in Celsius

Further Example: Bike Sharing Data

Data	Fields
atemp	"feels like" temperature in Celsius
humidity	relative humidity
windspeed	wind speed
casual	number of non-registered user rentals initiated
registered	number of registered user rentals initiated
count	number of total rentals

Univariate Regressions

library(readr)
library(tidyverse)
bikes <- read_csv("../Notes/Data/bike_sharing.csv") %>%
        mutate(season = as.factor(season)) %>%
        mutate(weather=as.factor(weather))

Univariate Regressions

mod1 <- lm(count~season, data=bikes)
mod2 <- lm(count~holiday, data=bikes)
mod3 <- lm(count~workingday, data=bikes)
mod4 <- lm(count~weather, data=bikes)
mod5 <- lm(count~temp, data=bikes)
mod6 <- lm(count~atemp, data=bikes)
mod7 <- lm(count~humidity, data=bikes)
mod8 <- lm(count~windspeed, data=bikes)
mod9 <- lm(count~casual, data=bikes)
mod10 <- lm(count~registered, data=bikes)

Univariate Regressions

library(broom)
tidy1 <- tidy( mod1, conf.int=T )[-1, -c(3:4) ]
tidy2 <- tidy(mod2, conf.int=T )[-1, -c(3:4) ]
tidy3 <- tidy(mod3 , conf.int=T)[-1, -c(3:4) ]
tidy4 <- tidy(mod4 , conf.int=T)[-1, -c(3:4) ]
tidy5 <- tidy(mod5, conf.int=T)[-1, -c(3:4) ]
tidy6 <- tidy(mod6 , conf.int=T)[-1, -c(3:4) ]
tidy7 <- tidy(mod7 , conf.int=T)[-1, -c(3:4) ]
tidy8 <- tidy(mod8 , conf.int=T)[-1, -c(3:4) ]
tidy9 <- tidy(mod9, conf.int=T)[-1, -c(3:4) ]
tidy10 <- tidy(mod10, conf.int=T)[-1, -c(3:4) ]
bind_rows(tidy1, tidy2, tidy3, tidy4, tidy5, tidy6, tidy7, tidy8, tidy9, tidy10)

Univariate Regressions

## # A tibble: 14 x 5
##    term       estimate   p.value conf.low conf.high
##    <chr>         <dbl>     <dbl>    <dbl>     <dbl>
##  1 season2       98.9  9.76e- 94    89.6     108.  
##  2 season3      118.   1.06e-131   109.      127.  
##  3 season4       82.6  2.13e- 66    73.3      92.0 
##  4 holiday       -5.86 5.74e-  1   -26.3      14.6 
##  5 workingday     4.51 2.26e-  1    -2.80     11.8 
##  6 weather2     -26.3  4.32e- 11   -34.1     -18.5 
##  7 weather3     -86.4  3.29e- 40   -99.1     -73.7 
##  8 weather4     -41.2  8.18e-  1  -393.      311.  
##  9 temp           9.17 0.            8.77      9.57
## 10 atemp          8.33 0.            7.96      8.70
## 11 humidity      -2.99 2.92e-253    -3.15     -2.82
## 12 windspeed      2.25 2.90e- 26     1.83      2.66
## 13 casual         2.50 0.            2.45      2.55
## 14 registered     1.16 0.            1.16      1.17

Multivariate

mod.final <- lm(count~season+weather+humidity+windspeed, data=bikes)
tidy(mod.final)[-1,-c(3:4)]
glance(mod.final)

Multivariate

## # A tibble: 8 x 3
##   term      estimate   p.value
##   <chr>        <dbl>     <dbl>
## 1 season2    116.    1.40e-145
## 2 season3    148.    7.52e-227
## 3 season4    118.    1.74e-147
## 4 weather2    20.0   1.38e-  7
## 5 weather3     0.124 9.84e-  1
## 6 weather4   162.    3.19e-  1
## 7 humidity    -3.49  3.86e-273
## 8 windspeed    0.633 2.05e-  3

Multivariate

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>
## 1     0.195         0.194  163.      329.       0     9 -70865. 1.42e5
## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>