The Variables

Name	Description
id	Subject identification number
age	Age in years
height	Height in inches
weight	Weight in lbs.
sbp	Systolic blood pressure in mm
dbp	Diastolic blood pressure in mm Hg
chol	Fasting serum cholesterol in mm
behpat	Behavior
	1 = A1
	2 = A2
	3 = B3
	4 = B4
ncigs	Cigarettes per day
dibpat	Behavior
	1 = type A
	2 = type B
chd69	Coronary heart disease
	1 = Yes
	0 = no
typechd	Type of CHD
	1 = myocardial infarction or death
	2 = silent myocardial infarction
	3 = angina perctoris
time169	Time of CHD event or end of follow-up
arcus	Arcus senilis
	0 = absent
	1 = present
bmi	Body Mass Index

Begin by exploring the data and relationships further. What kind of relationships do you notice with outcome of Coronary Heart Disease (CHD)? Make sure to use appropriate tables, graphs or summaries.

For continuous covariates, we can explore their relationship with CHD by using boxplots. For categorical covariates, we can create contingency tables by table function, in order to explore their relationship with CHD.

numeric <- c("age","height","weight","sbp","dbp","chol","ncigs","bmi","time169")
knitr::kable(summary(wcgs[numeric]))
category <- c("behpat","dibpat","chd69","typchd69","arcus")
apply(wcgs[category],2,table)

#plot for numerical variables vs. CHD
library(ggplot2)
par(mfrow = c(3,3))
for (var in numeric) {
  boxplot(as.formula(paste0(var,"~chd69")), data = wcgs, xlab = "chd", ylab = var)
}

#tables for categorical variable vs. CHD
for (var in category) {
  print(table(wcgs[c("chd69",var)]))
}

##      behpat
## chd69    1    2    3    4
##     0  234 1177 1155  331
##     1   30  148   61   18
##      dibpat
## chd69    0    1
##     0 1486 1411
##     1   79  178
##      chd69
## chd69    0    1
##     0 2897    0
##     1    0  257
##      typchd69
## chd69    0    1    2    3
##     0 2897    0    0    0
##     1    0  135   71   51
##      arcus
## chd69    0    1
##     0 2058  839
##     1  153  102

Create logistic regressions with the following variables:
- age
- bmi
- sbp
- dbp
- behpat
- ncigs
- dibpat
- arcus
Create a table with these univariate regressions. Interpret them.

#turn categorical variables into factors
library(tidyverse)
wcgs<-wcgs %>% mutate(behpat=factor(behpat,levels = c(1,2,3,4),labels=c("A1","A2","B3","B4"))) %>%
           mutate(dibpat=factor(dibpat,levels=c(1,0),labels=c("A","B")))%>%
           mutate(chd69=as.factor(chd69))%>%
           mutate(typchd69=as.factor(typchd69))%>%
           mutate(arcus=as.factor(arcus))

library(broom)
result <- c()
for (var in c("age","bmi","sbp","dbp","behpat","ncigs","dibpat","arcus")) {
  f <- as.formula(paste0("chd69~", var))
  m <- glm(f, data = wcgs, family = binomial(link = "logit"))
  r <- tidy(m, exponentiate = T, conf.int = T)[-1,-c(3,4)]
  result <- rbind(result,r)
}
result

##        term  estimate      p.value  conf.low conf.high
## 2       age 1.0772619 4.557900e-11 1.0536639 1.1014329
## 21      bmi 1.0879302 4.746882e-04 1.0372496 1.1401466
## 22      sbp 1.0270301 3.727669e-13 1.0196129 1.0344164
## 23      dbp 1.0341298 2.007131e-08 1.0219629 1.0462403
## 24 behpatA2 0.9807986 9.273491e-01 0.6553361 1.5123703
## 3  behpatB3 0.4119481 1.529395e-04 0.2624012 0.6593955
## 4  behpatB4 0.4241692 5.685465e-03 0.2270284 0.7715971
## 25    ncigs 1.0234915 9.223399e-09 1.0153544 1.0315901
## 26  dibpatB 0.4214201 7.118661e-10 0.3186990 0.5525801
## 27   arcus1 1.6352801 2.483353e-04 1.2542942 2.1240618

The coefficient interpretation is different for continuous covariate and categorical covariate. For a continuouse covariate, its exponentiated coefficient can be interpreted in following ways:

Take age as an example:

\(\cdot\) If there are 2 people who differ in age by one year the older person would on average have an odds of CHD 1.077 times that of the younger.

\(\cdot\) On average a person has an odds of CHD 0.077 higher than someone a year younger.

\(\cdot\) On average a person has an odds of CHD 7.726% higher than someone a year younger.

For a categorical covariate, its exponentiated coefficeint can be interpreted in following ways:

Take arcus as an example:

\(\cdot\) On average a person with Arcus Senilis present has an odds of CHD 1.635 times that of the odds of CHD for another person with absence of Arcus Senilis.

\(\cdot\) On average a person with Arcus Senilis present has an odds of CHD 63.528% higher than the odds of CHD for another person with absence of Arcus Senilis.

Consider a multiple with all of the above variables.

table(wcgs$dibpat,wcgs$behpat)
m1<-glm(chd69 ~ age+bmi+sbp+dbp+behpat+dibpat+ncigs+arcus, data = wcgs, family = binomial(link = "logit"))

#m2<-glm(chd69 ~ age+bmi+sbp+dbp+dibpat+behpat+ncigs+arcus, data = wcgs, family = binomial(link = "logit"))

tidy(m1, exponentiate = T, conf.int = T)[,-c(3,4)]
#tidy(m2, exponentiate = T, conf.int = T)

##    
##       A1   A2   B3   B4
##   A  264 1325    0    0
##   B    0    0 1216  349
##           term     estimate      p.value     conf.low   conf.high
## 1  (Intercept) 6.644406e-05 1.584605e-24 1.035371e-05 0.000415262
## 2          age 1.060567e+00 1.170574e-06 1.035709e+00 1.086046836
## 3          bmi 1.068028e+00 1.151964e-02 1.014586e+00 1.123732275
## 4          sbp 1.019573e+00 2.095482e-03 1.006924e+00 1.032132283
## 5          dbp 1.000659e+00 9.498950e-01 9.802821e-01 1.021435331
## 6     behpatA2 1.039891e+00 8.585570e-01 6.851928e-01 1.624144766
## 7     behpatB3 5.039283e-01 4.609561e-03 3.161822e-01 0.818601770
## 8     behpatB4 5.664717e-01 7.351574e-02 2.991780e-01 1.046538828
## 9        ncigs 1.022909e+00 6.299268e-08 1.014498e+00 1.031305046
## 10      arcus1 1.378632e+00 2.282318e-02 1.043367e+00 1.814703356

a. Comment on the changes you see in coefficients from the univariate to multiple regression model.

From the output of m1, we can see there’s no coefficent for dibpat. This is due to behpat and dibpat are correlated, in the sense if behpat is A1 or A2, then dibpat must be A. Therefore, bahpat contains the information of dibpat. For other covariates, we can see the predictor sbp is no longer statistically significant in the mutiple regression model. The coefficeint for behpatA2 increased from 0.98 to 1.03, the coefficient for arcus decreased from 1.63 to 1.37.

b. Pick the 3 most significant coefficients and interpret them.

Based on p-values, we choose age, ncigs and `sbp`` as 3 most significant coefficients.

Take age as an example: On average a person has an odds of CHD 0.060 higher than someone a year younger, conditioning on all other covariates.

What is the log odds of CHD for a 65 year old with bmi of 36, sbp of 136, dbp 78, behpat A2, ncigs=3, dibpat Type B and No Arcus.

For prediction, we need to first create a new data entry, then use predict function with the new dataset to predict the log odds.

new_input <- data.frame(age = 65,bmi = 36, sbp = 136,dbp = 78,behpat = "A2",ncigs = 3,arcus = factor(0), dibpat = "B")
log_odds <- predict(m1,newdata=new_input)
log_odds

##          1 
## -0.6328796

What is the probability of CHD for a 65 year old with bmi of 36, sbp of 136, dbp 78, behpat A2, ncigs=3, dibpat Type B and No Arcus.

p <- exp(log_odds)/exp(1+log_odds)
p

##         1 
## 0.3678794

The Data

Reading in the Data

The Variables