The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.

Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.

The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected. This data is from Hosmer et al. , 2013.

Variable Name Description
id Identification Code
low 0 = Birthweight \(\ge\) 2500g
1=Birthweight< 2500g
age Age of mother in years
lwt Weight in Pounds at last menstrual period
race 1 = white
2 = black
3 = other
ptl History of Premature Labor (0=none, 1= One, …)
ht History of hyptertension
ui Presence of Uterine Irritability
0 = No
1 = Yes
ftv Number of Physician visits during first trimester (0=none, 1=One, …)
btw Birth weight in grams

You can read the data in with the command below.

low.weight <- read.table("https://drive.google.com/uc?export=download&id=0B8CsRLdwqzbzMzJyVkt5QkdvVnM", header=TRUE, sep=",")

Model Building

  1. Your goal will be to build a model to predict low birth weight. Begin by using number summaries and graphs to start to explore relationships of variables in this data set and low.

For exploring the relationship between low and continuous variables, we can use boxplots. For exploring the relationship between low and categorical variables, we can: 1. create contingency tables which tabulate the counts of each category; 2. display the proportion of each level of the categorical variable by using barplot.

s1 <- apply(low.weight, 2, summary)
knitr::kable(s1)
#or we can use
summary(low.weight)
#From the summry, we can see race, smoke, ht and ui are treated as continuous variables. 
id low age lwt race smoke ptl ht ui ftv bwt
Min. 4.0000 0.0000000 14.0000 80.0000 1.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 709.000
1st Qu. 68.0000 0.0000000 19.0000 110.0000 1.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2414.000
Median 123.0000 0.0000000 23.0000 121.0000 1.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2977.000
Mean 121.0794 0.3121693 23.2381 129.8148 1.846561 0.3915344 0.1957672 0.0634921 0.1481481 0.7936508 2944.656
3rd Qu. 176.0000 1.0000000 26.0000 140.0000 3.000000 1.0000000 0.0000000 0.0000000 0.0000000 1.0000000 3475.000
Max. 226.0000 1.0000000 45.0000 250.0000 3.000000 1.0000000 3.0000000 1.0000000 1.0000000 6.0000000 4990.000
##        id             low              age             lwt       
##  Min.   :  4.0   Min.   :0.0000   Min.   :14.00   Min.   : 80.0  
##  1st Qu.: 68.0   1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0  
##  Median :123.0   Median :0.0000   Median :23.00   Median :121.0  
##  Mean   :121.1   Mean   :0.3122   Mean   :23.24   Mean   :129.8  
##  3rd Qu.:176.0   3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0  
##  Max.   :226.0   Max.   :1.0000   Max.   :45.00   Max.   :250.0  
##       race           smoke             ptl               ht         
##  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :1.847   Mean   :0.3915   Mean   :0.1958   Mean   :0.06349  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :3.000   Max.   :1.0000   Max.   :3.0000   Max.   :1.00000  
##        ui              ftv              bwt      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :0.0000   Median :2977  
##  Mean   :0.1481   Mean   :0.7937   Mean   :2945  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3475  
##  Max.   :1.0000   Max.   :6.0000   Max.   :4990
library(ggplot2)
library(gridExtra)
numerical_var <- c("age","lwt","ptl","ht","ftv")
categorical_var <- c("low","race","ui","smoke")
low.weight$low<-as.factor(low.weight$low)

p1 <- ggplot(low.weight, aes(x = low, y = age)) + geom_boxplot()
p2 <- ggplot(low.weight, aes(x = low, y = lwt)) + geom_boxplot()
p3 <- ggplot(low.weight, aes(x = low, y = ptl)) + geom_boxplot()
p4 <- ggplot(low.weight, aes(x = low, y = ftv)) + geom_boxplot()

grid.arrange(p1,p2,p3,p4,ncol=2)

# Alternative way:
# par(mfrow=c(2,2))
# boxplot(age~low,data=low.weight,ylab="Age",xlab="Low")
# boxplot(lwt~low,data=low.weight,ylab="lwt",xlab="Low")
# boxplot(ptl~low,data=low.weight,ylab="ptl",xlab="Low")
# boxplot(ftv~low,data=low.weight,ylab="ftv",xlab="Low")

# Example: low vs race
#create barplots 
T1<-table(low.weight$low,low.weight$race)

#get proportions of normal/low weight within each race (2 represents column indicator)
par(mfrow=c(1,1))
prop.table(T1,2)
barplot(prop.table(T1,2),legend.text = c("Normal","Low"),names.arg = c("white","black","other"))

#get proportions of different races within each weight category(2 represents column indicator)
T2<-table(low.weight$race,low.weight$low)
T2
barplot(prop.table(T2,2),legend.text = c("white","black","other"),names.arg = c("Normal","Low"))

##    
##             1         2         3
##   0 0.7604167 0.5769231 0.6268657
##   1 0.2395833 0.4230769 0.3731343
##    
##      0  1
##   1 73 23
##   2 15 11
##   3 42 25
  1. The variables of low, race and ui are categorical variables but they are not yet factors. Code them in R to be factors in the data. Then make sure they have correct level names.
library(tidyverse)
low.weight<-low.weight %>% 
             mutate(low=as.factor(low))%>%
             mutate(race=factor(race,levels=c(1,2,3),labels=c("white","black","other")))%>%
             mutate(ui=factor(ui,levels=c(0,1),labels=c("No","Yes"))) %>%
             mutate(smoke=as.factor(smoke))%>%
             mutate(ht=as.factor(ht))

#Alternative way:
# low.weight$race<-factor(low.weight$race,levels=c(1,2,3),labels=c("white","black","other"))
# low.weight$smoke<-as.factor(low.weight$smoke)
  1. Start your model building by looking at simple logistic regressions for each of the 8 predictor variables. Display and Examine relevant plots. Summarize the simple logistic regression results using a table (hide the intercepts when combining your tidy() commands).
library(broom)
# Here, we fit 8 simple logistic regression models with a loop function.  
result <- c()
for (var in c("age","lwt","race","smoke","ptl","ht","ui","ftv","bwt")) {
  f <- as.formula(paste0("low~", var))
  m <- glm(f, data = low.weight, family = binomial(link = "logit"))
  r <- tidy(m, exponentiate = T, conf.int = T)[-1,-c(3,4)]
  result <- rbind(result,r)
}
result

# Eample: age
# fit.age<-glm(low~age,data=low.weight,family=binomial(link="logit"))
# tidy(fit.age)
# tidy(fit.age,exponentiate = T, conf.int = T)
##         term  estimate    p.value     conf.low    conf.high
## 2        age 0.9501333 0.10454800 8.912949e-01 1.009027e+00
## 21       lwt 0.9860401 0.02268856 9.733982e-01 9.973535e-01
## 22 raceblack 2.3275362 0.06830125 9.255511e-01 5.774700e+00
## 3  raceother 1.8892340 0.06740445 9.565420e-01 3.757871e+00
## 23    smoke1 2.0219436 0.02761962 1.081872e+00 3.800582e+00
## 24       ptl 2.2295634 0.01146709 1.217895e+00 4.283620e+00
## 25       ht1 3.3653846 0.04606314 1.028381e+00 1.182912e+01
## 26     uiYes 2.5777778 0.02308380 1.133202e+00 5.881263e+00
## 27       ftv 0.8736112 0.38852661 6.320298e-01 1.174074e+00
## 28       bwt 0.3054851 0.98915608 2.747029e-13 4.000659e-08
  1. Comment of the significance of the 8 variables. What variables do you think would best be used in a multiple logistic regression?

Based on p-values, lwt, smoke, ptl, ht and ui are statistically significant at 0.05 significant level, and race is significant at 0.1 significant level.

  1. Explore the possibility of interaction between smoking and race. Display a graph that would allow you to explore this and then run a regression with the interaction term. Interpret the results of this model.
fit1<-glm(low~smoke*race,data=low.weight,family=binomial("logit"))
prob<-predict(fit1,type="response")
interaction.plot(x.factor = low.weight$smoke,trace.factor=low.weight$race,response = prob,xlab="Smoking status",ylab="Probability of haveing low birthweight babies.")

# An alternative way:
# library(Hmisc)
#  plsmo(low.weight$smoke, low.weight$low, datadensity = T, group = low.weight$race, 
#  col=c('black', 'red','blue'), xlab = 'Smoke', ylab ='Low',
#  ylim = c(0,1))
  
tidy(fit1,conf.int = T)[,-c(3:4)]
##               term   estimate      p.value    conf.low  conf.high
## 1      (Intercept) -2.3025850 1.124386e-05 -3.50598590 -1.3945304
## 2           smoke1  1.7505164 3.429174e-03  0.66242458  3.0618216
## 3        raceblack  1.5141276 4.412016e-02  0.03896048  3.0587366
## 4        raceother  1.7429692 3.371066e-03  0.66364168  3.0487627
## 5 smoke1:raceblack -0.5565939 5.897208e-01 -2.60189233  1.4968087
## 6 smoke1:raceother -1.5273729 8.358729e-02 -3.34321787  0.1575523

From the interaction plot, we can see the probability of having low birthweight babies for people in the same race is different by their smoking status. Therefore, there is an interaction between smoking and race.

Based on the original outputs (without exponentiating the coefficients), we can interpret the interaction as:

1. For White people, the log odds ratio between smokers and non-smokers is 1.75.

2. For people whose race is black, the log odds ratio between smokers and non-smokers is 1.75-0.55=1.2.

3. For people whose race is other, the log odds ratio between smokers and non-smokers is 1.75-1.52=0.23.

4. For smokers, the log odds ratio between black and white is 1.514-0.55=0.964, the log odds ratio between other and white is 1.74-1.52=0.22.

tidy(fit1,exponentiate = T,conf.int = T)[,-c(3:4)]
##               term  estimate      p.value   conf.low  conf.high
## 1      (Intercept) 0.1000000 1.124386e-05 0.03001716  0.2479494
## 2           smoke1 5.7575752 3.429174e-03 1.93948909 21.3664428
## 3        raceblack 4.5454541 4.412016e-02 1.03972939 21.3006300
## 4        raceother 5.7142851 3.371066e-03 1.94185108 21.0892350
## 5 smoke1:raceblack 0.5731580 5.897208e-01 0.07413316  4.4674095
## 6 smoke1:raceother 0.2171053 8.358729e-02 0.03532311  1.1706420

Based on the exponentiated outputs, we can interpret the interaction as:

1. For White people, the odds ratio between smokers and non-smokers is 5.75.

2. For people whose race is black, the odds ratio between smokers and non-smokers is \(5.75\times 0.573=3.28\)

3. For people whose race is other, the odds ratio between smokers and non-smokers is \(5.75\times 0.217=1.248\)

4. For smokers, the odds ratio between black and white is \(4.545\times 0.573=2.60\), the odds ratio between other and white is \(5.71\times 0.217=1.24\)

  1. Build a multiple regression model with what you have found in problems 4 and 5. Do the coefficients change from the simple regressions? Comment on both direction and magnitude changes.

From problem 4 and 5, we would include predictors that are statistically significant at 0.1 significant level and the interaction term between smoking and race in the multiple regression model.

fit2<-glm(low~lwt+ptl+ht+ui+race*smoke,data=low.weight,family=binomial)
tidy(fit2,exponentiate = T)
##                term  estimate   std.error  statistic    p.value
## 1       (Intercept) 0.6011423 1.060793206 -0.4797576 0.63139973
## 2               lwt 0.9848478 0.007019635 -2.1750671 0.02962509
## 3               ptl 1.6863028 0.345028552  1.5144788 0.12990444
## 4               ht1 6.0111615 0.707328092  2.5357652 0.01122019
## 5             uiYes 2.3648168 0.460713659  1.8681898 0.06173562
## 6         raceblack 4.4068072 0.781746285  1.8972274 0.05779794
## 7         raceother 4.0292163 0.622143605  2.2399521 0.02509403
## 8            smoke1 4.0049510 0.625234317  2.2192181 0.02647189
## 9  raceblack:smoke1 0.9699162 1.080888164 -0.0282597 0.97745503
## 10 raceother:smoke1 0.2648653 0.936452450 -1.4186880 0.15598999