We will begin with an example of simple logistic regression with a sample of the Western Collaborative Group Study. This study began in 1960 with 3154 men ages 39-59, who were employed in one of 11 California based companies. They were followed until 1969 during this time, 257 of these men developed coronary heart disease (CHD). You can read this data in with the code below. Lab 4 Markdown
library(haven)
wcgs <- read_dta("wcgs2.dta")
wcgs <- wcgs[, -16]
Name | Description |
---|---|
id | Subject identification number |
age | Age in years |
height | Height in inches |
weight | Weight in lbs. |
sbp | Systolic blood pressure in mm |
dbp | Diastolic blood pressure in mm Hg |
chol | Fasting serum cholesterol in mm |
behpat | Behavior |
1 = A1 | |
2 = A2 | |
3 = B3 | |
4 = B4 | |
ncigs | Cigarettes per day |
dibpat | Behavior |
1 = type A | |
2 = type B | |
chd69 | Coronary heart disease |
1 = Yes | |
0 = no | |
typechd | Type of CHD |
1 = myocardial infarction or death | |
2 = silent myocardial infarction | |
3 = angina perctoris | |
time169 | Time of CHD event or end of follow-up |
arcus | Arcus senilis |
0 = absent | |
1 = present | |
bmi | Body Mass Index |
For continuous covariates, we can explore their relationship with CHD by using boxplots. For categorical covariates, we can create contingency tables by table
function, in order to explore their relationship with CHD.
numeric <- c("age","height","weight","sbp","dbp","chol","ncigs","bmi","time169")
knitr::kable(summary(wcgs[numeric]))
category <- c("behpat","dibpat","chd69","typchd69","arcus")
apply(wcgs[category],2,table)
#plot for numerical variables vs. CHD
library(ggplot2)
par(mfrow = c(3,3))
for (var in numeric) {
boxplot(as.formula(paste0(var,"~chd69")), data = wcgs, xlab = "chd", ylab = var)
}
#tables for categorical variable vs. CHD
for (var in category) {
print(table(wcgs[c("chd69",var)]))
}
## behpat
## chd69 1 2 3 4
## 0 234 1177 1155 331
## 1 30 148 61 18
## dibpat
## chd69 0 1
## 0 1486 1411
## 1 79 178
## chd69
## chd69 0 1
## 0 2897 0
## 1 0 257
## typchd69
## chd69 0 1 2 3
## 0 2897 0 0 0
## 1 0 135 71 51
## arcus
## chd69 0 1
## 0 2058 839
## 1 153 102
age
bmi
sbp
dbp
behpat
ncigs
dibpat
arcus
#turn categorical variables into factors
library(tidyverse)
wcgs<-wcgs %>% mutate(behpat=factor(behpat,levels = c(1,2,3,4),labels=c("A1","A2","B3","B4"))) %>%
mutate(dibpat=factor(dibpat,levels=c(1,0),labels=c("A","B")))%>%
mutate(chd69=as.factor(chd69))%>%
mutate(typchd69=as.factor(typchd69))%>%
mutate(arcus=as.factor(arcus))
library(broom)
result <- c()
for (var in c("age","bmi","sbp","dbp","behpat","ncigs","dibpat","arcus")) {
f <- as.formula(paste0("chd69~", var))
m <- glm(f, data = wcgs, family = binomial(link = "logit"))
r <- tidy(m, exponentiate = T, conf.int = T)[-1,-c(3,4)]
result <- rbind(result,r)
}
result
## term estimate p.value conf.low conf.high
## 2 age 1.0772619 4.557900e-11 1.0536639 1.1014329
## 21 bmi 1.0879302 4.746882e-04 1.0372496 1.1401466
## 22 sbp 1.0270301 3.727669e-13 1.0196129 1.0344164
## 23 dbp 1.0341298 2.007131e-08 1.0219629 1.0462403
## 24 behpatA2 0.9807986 9.273491e-01 0.6553361 1.5123703
## 3 behpatB3 0.4119481 1.529395e-04 0.2624012 0.6593955
## 4 behpatB4 0.4241692 5.685465e-03 0.2270284 0.7715971
## 25 ncigs 1.0234915 9.223399e-09 1.0153544 1.0315901
## 26 dibpatB 0.4214201 7.118661e-10 0.3186990 0.5525801
## 27 arcus1 1.6352801 2.483353e-04 1.2542942 2.1240618
The coefficient interpretation is different for continuous covariate and categorical covariate. For a continuouse covariate, its exponentiated coefficient can be interpreted in following ways:
Take age
as an example:
\(\cdot\) If there are 2 people who differ in age by one year the older person would on average have an odds of CHD 1.077 times that of the younger.
\(\cdot\) On average a person has an odds of CHD 0.077 higher than someone a year younger.
\(\cdot\) On average a person has an odds of CHD 7.726% higher than someone a year younger.
For a categorical covariate, its exponentiated coefficeint can be interpreted in following ways:
Take arcus
as an example:
\(\cdot\) On average a person with Arcus Senilis present has an odds of CHD 1.635 times that of the odds of CHD for another person with absence of Arcus Senilis.
\(\cdot\) On average a person with Arcus Senilis present has an odds of CHD 63.528% higher than the odds of CHD for another person with absence of Arcus Senilis.
table(wcgs$dibpat,wcgs$behpat)
m1<-glm(chd69 ~ age+bmi+sbp+dbp+behpat+dibpat+ncigs+arcus, data = wcgs, family = binomial(link = "logit"))
#m2<-glm(chd69 ~ age+bmi+sbp+dbp+dibpat+behpat+ncigs+arcus, data = wcgs, family = binomial(link = "logit"))
tidy(m1, exponentiate = T, conf.int = T)[,-c(3,4)]
#tidy(m2, exponentiate = T, conf.int = T)
##
## A1 A2 B3 B4
## A 264 1325 0 0
## B 0 0 1216 349
## term estimate p.value conf.low conf.high
## 1 (Intercept) 6.644406e-05 1.584605e-24 1.035371e-05 0.000415262
## 2 age 1.060567e+00 1.170574e-06 1.035709e+00 1.086046836
## 3 bmi 1.068028e+00 1.151964e-02 1.014586e+00 1.123732275
## 4 sbp 1.019573e+00 2.095482e-03 1.006924e+00 1.032132283
## 5 dbp 1.000659e+00 9.498950e-01 9.802821e-01 1.021435331
## 6 behpatA2 1.039891e+00 8.585570e-01 6.851928e-01 1.624144766
## 7 behpatB3 5.039283e-01 4.609561e-03 3.161822e-01 0.818601770
## 8 behpatB4 5.664717e-01 7.351574e-02 2.991780e-01 1.046538828
## 9 ncigs 1.022909e+00 6.299268e-08 1.014498e+00 1.031305046
## 10 arcus1 1.378632e+00 2.282318e-02 1.043367e+00 1.814703356
a. Comment on the changes you see in coefficients from the univariate to multiple regression model.
From the output of m1, we can see there’s no coefficent for dibpat
. This is due to behpat
and dibpat
are correlated, in the sense if behpat is A1 or A2, then dibpat must be A. Therefore, bahpat
contains the information of dibpat
. For other covariates, we can see the predictor sbp
is no longer statistically significant in the mutiple regression model. The coefficeint for behpatA2
increased from 0.98 to 1.03, the coefficient for arcus
decreased from 1.63 to 1.37.
b. Pick the 3 most significant coefficients and interpret them.
Based on p-values, we choose age
, ncigs
and `sbp`` as 3 most significant coefficients.
Take age
as an example: On average a person has an odds of CHD 0.060 higher than someone a year younger, conditioning on all other covariates.
For prediction, we need to first create a new data entry, then use predict
function with the new dataset to predict the log odds.
new_input <- data.frame(age = 65,bmi = 36, sbp = 136,dbp = 78,behpat = "A2",ncigs = 3,arcus = factor(0), dibpat = "B")
log_odds <- predict(m1,newdata=new_input)
log_odds
## 1
## -0.6328796
p <- exp(log_odds)/exp(1+log_odds)
p
## 1
## 0.3678794