The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.

Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.

The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected. This data is from Hosmer et al. , 2013.

Variable Name	Description
id	Identification Code
low	0 = Birthweight \(\ge\) 2500g
	1=Birthweight< 2500g
age	Age of mother in years
lwt	Weight in Pounds at last menstrual period
race	1 = white
	2 = black
	3 = other
ptl	History of Premature Labor (0=none, 1= One, …)
ht	History of hyptertension
ui	Presence of Uterine Irritability
	0 = No
	1 = Yes
ftv	Number of Physician visits during first trimester (0=none, 1=One, …)
btw	Birth weight in grams

Model Building

Your goal will be to build a model to predict low birth weight. Begin by using number summaries and graphs to start to explore relationships of variables in this data set and low.

For exploring the relationship between low and continuous variables, we can use boxplots. For exploring the relationship between low and categorical variables, we can: 1. create contingency tables which tabulate the counts of each category; 2. display the proportion of each level of the categorical variable by using barplot.

s1 <- apply(low.weight, 2, summary)
knitr::kable(s1)
#or we can use
summary(low.weight)
#From the summry, we can see race, smoke, ht and ui are treated as continuous variables.

	id	low	age	lwt	race	smoke	ptl	ht	ui	ftv	bwt
Min.	4.0000	0.0000000	14.0000	80.0000	1.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	709.000
1st Qu.	68.0000	0.0000000	19.0000	110.0000	1.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	2414.000
Median	123.0000	0.0000000	23.0000	121.0000	1.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	2977.000
Mean	121.0794	0.3121693	23.2381	129.8148	1.846561	0.3915344	0.1957672	0.0634921	0.1481481	0.7936508	2944.656
3rd Qu.	176.0000	1.0000000	26.0000	140.0000	3.000000	1.0000000	0.0000000	0.0000000	0.0000000	1.0000000	3475.000
Max.	226.0000	1.0000000	45.0000	250.0000	3.000000	1.0000000	3.0000000	1.0000000	1.0000000	6.0000000	4990.000

##        id             low              age             lwt       
##  Min.   :  4.0   Min.   :0.0000   Min.   :14.00   Min.   : 80.0  
##  1st Qu.: 68.0   1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0  
##  Median :123.0   Median :0.0000   Median :23.00   Median :121.0  
##  Mean   :121.1   Mean   :0.3122   Mean   :23.24   Mean   :129.8  
##  3rd Qu.:176.0   3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0  
##  Max.   :226.0   Max.   :1.0000   Max.   :45.00   Max.   :250.0  
##       race           smoke             ptl               ht         
##  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :1.000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :1.847   Mean   :0.3915   Mean   :0.1958   Mean   :0.06349  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :3.000   Max.   :1.0000   Max.   :3.0000   Max.   :1.00000  
##        ui              ftv              bwt      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :0.0000   Median :2977  
##  Mean   :0.1481   Mean   :0.7937   Mean   :2945  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3475  
##  Max.   :1.0000   Max.   :6.0000   Max.   :4990

library(ggplot2)
library(gridExtra)
numerical_var <- c("age","lwt","ptl","ht","ftv")
categorical_var <- c("low","race","ui","smoke")
low.weight$low<-as.factor(low.weight$low)

p1 <- ggplot(low.weight, aes(x = low, y = age)) + geom_boxplot()
p2 <- ggplot(low.weight, aes(x = low, y = lwt)) + geom_boxplot()
p3 <- ggplot(low.weight, aes(x = low, y = ptl)) + geom_boxplot()
p4 <- ggplot(low.weight, aes(x = low, y = ftv)) + geom_boxplot()

grid.arrange(p1,p2,p3,p4,ncol=2)

# Alternative way:
# par(mfrow=c(2,2))
# boxplot(age~low,data=low.weight,ylab="Age",xlab="Low")
# boxplot(lwt~low,data=low.weight,ylab="lwt",xlab="Low")
# boxplot(ptl~low,data=low.weight,ylab="ptl",xlab="Low")
# boxplot(ftv~low,data=low.weight,ylab="ftv",xlab="Low")

# Example: low vs race
#create barplots 
T1<-table(low.weight$low,low.weight$race)

#get proportions of normal/low weight within each race (2 represents column indicator)
par(mfrow=c(1,1))
prop.table(T1,2)
barplot(prop.table(T1,2),legend.text = c("Normal","Low"),names.arg = c("white","black","other"))

#get proportions of different races within each weight category(2 represents column indicator)
T2<-table(low.weight$race,low.weight$low)
T2
barplot(prop.table(T2,2),legend.text = c("white","black","other"),names.arg = c("Normal","Low"))

##    
##             1         2         3
##   0 0.7604167 0.5769231 0.6268657
##   1 0.2395833 0.4230769 0.3731343
##    
##      0  1
##   1 73 23
##   2 15 11
##   3 42 25

The variables of low, race and ui are categorical variables but they are not yet factors. Code them in R to be factors in the data. Then make sure they have correct level names.

library(tidyverse)
low.weight<-low.weight %>% 
             mutate(low=as.factor(low))%>%
             mutate(race=factor(race,levels=c(1,2,3),labels=c("white","black","other")))%>%
             mutate(ui=factor(ui,levels=c(0,1),labels=c("No","Yes"))) %>%
             mutate(smoke=as.factor(smoke))%>%
             mutate(ht=as.factor(ht))

#Alternative way:
# low.weight$race<-factor(low.weight$race,levels=c(1,2,3),labels=c("white","black","other"))
# low.weight$smoke<-as.factor(low.weight$smoke)

Start your model building by looking at simple logistic regressions for each of the 8 predictor variables. Display and Examine relevant plots. Summarize the simple logistic regression results using a table (hide the intercepts when combining your tidy() commands).

library(broom)
# Here, we fit 8 simple logistic regression models with a loop function.  
result <- c()
for (var in c("age","lwt","race","smoke","ptl","ht","ui","ftv","bwt")) {
  f <- as.formula(paste0("low~", var))
  m <- glm(f, data = low.weight, family = binomial(link = "logit"))
  r <- tidy(m, exponentiate = T, conf.int = T)[-1,-c(3,4)]
  result <- rbind(result,r)
}
result

# Eample: age
# fit.age<-glm(low~age,data=low.weight,family=binomial(link="logit"))
# tidy(fit.age)
# tidy(fit.age,exponentiate = T, conf.int = T)

##         term  estimate    p.value     conf.low    conf.high
## 2        age 0.9501333 0.10454800 8.912949e-01 1.009027e+00
## 21       lwt 0.9860401 0.02268856 9.733982e-01 9.973535e-01
## 22 raceblack 2.3275362 0.06830125 9.255511e-01 5.774700e+00
## 3  raceother 1.8892340 0.06740445 9.565420e-01 3.757871e+00
## 23    smoke1 2.0219436 0.02761962 1.081872e+00 3.800582e+00
## 24       ptl 2.2295634 0.01146709 1.217895e+00 4.283620e+00
## 25       ht1 3.3653846 0.04606314 1.028381e+00 1.182912e+01
## 26     uiYes 2.5777778 0.02308380 1.133202e+00 5.881263e+00
## 27       ftv 0.8736112 0.38852661 6.320298e-01 1.174074e+00
## 28       bwt 0.3054851 0.98915608 2.747029e-13 4.000659e-08

Comment of the significance of the 8 variables. What variables do you think would best be used in a multiple logistic regression?

Based on p-values, lwt, smoke, ptl, ht and ui are statistically significant at 0.05 significant level, and race is significant at 0.1 significant level.

Explore the possibility of interaction between smoking and race. Display a graph that would allow you to explore this and then run a regression with the interaction term. Interpret the results of this model.

fit1<-glm(low~smoke*race,data=low.weight,family=binomial("logit"))
prob<-predict(fit1,type="response")
interaction.plot(x.factor = low.weight$smoke,trace.factor=low.weight$race,response = prob,xlab="Smoking status",ylab="Probability of haveing low birthweight babies.")

# An alternative way:
# library(Hmisc)
#  plsmo(low.weight$smoke, low.weight$low, datadensity = T, group = low.weight$race, 
#  col=c('black', 'red','blue'), xlab = 'Smoke', ylab ='Low',
#  ylim = c(0,1))
  
tidy(fit1,conf.int = T)[,-c(3:4)]

##               term   estimate      p.value    conf.low  conf.high
## 1      (Intercept) -2.3025850 1.124386e-05 -3.50598590 -1.3945304
## 2           smoke1  1.7505164 3.429174e-03  0.66242458  3.0618216
## 3        raceblack  1.5141276 4.412016e-02  0.03896048  3.0587366
## 4        raceother  1.7429692 3.371066e-03  0.66364168  3.0487627
## 5 smoke1:raceblack -0.5565939 5.897208e-01 -2.60189233  1.4968087
## 6 smoke1:raceother -1.5273729 8.358729e-02 -3.34321787  0.1575523

From the interaction plot, we can see the probability of having low birthweight babies for people in the same race is different by their smoking status. Therefore, there is an interaction between smoking and race.

Based on the original outputs (without exponentiating the coefficients), we can interpret the interaction as:

1. For White people, the log odds ratio between smokers and non-smokers is 1.75.

2. For people whose race is black, the log odds ratio between smokers and non-smokers is 1.75-0.55=1.2.

3. For people whose race is other, the log odds ratio between smokers and non-smokers is 1.75-1.52=0.23.

4. For smokers, the log odds ratio between black and white is 1.514-0.55=0.964, the log odds ratio between other and white is 1.74-1.52=0.22.

tidy(fit1,exponentiate = T,conf.int = T)[,-c(3:4)]

##               term  estimate      p.value   conf.low  conf.high
## 1      (Intercept) 0.1000000 1.124386e-05 0.03001716  0.2479494
## 2           smoke1 5.7575752 3.429174e-03 1.93948909 21.3664428
## 3        raceblack 4.5454541 4.412016e-02 1.03972939 21.3006300
## 4        raceother 5.7142851 3.371066e-03 1.94185108 21.0892350
## 5 smoke1:raceblack 0.5731580 5.897208e-01 0.07413316  4.4674095
## 6 smoke1:raceother 0.2171053 8.358729e-02 0.03532311  1.1706420

Based on the exponentiated outputs, we can interpret the interaction as:

1. For White people, the odds ratio between smokers and non-smokers is 5.75.

2. For people whose race is black, the odds ratio between smokers and non-smokers is \(5.75\times 0.573=3.28\)

3. For people whose race is other, the odds ratio between smokers and non-smokers is \(5.75\times 0.217=1.248\)

4. For smokers, the odds ratio between black and white is \(4.545\times 0.573=2.60\), the odds ratio between other and white is \(5.71\times 0.217=1.24\)

Build a multiple regression model with what you have found in problems 4 and 5. Do the coefficients change from the simple regressions? Comment on both direction and magnitude changes.

From problem 4 and 5, we would include predictors that are statistically significant at 0.1 significant level and the interaction term between smoking and race in the multiple regression model.

fit2<-glm(low~lwt+ptl+ht+ui+race*smoke,data=low.weight,family=binomial)
tidy(fit2,exponentiate = T)

##                term  estimate   std.error  statistic    p.value
## 1       (Intercept) 0.6011423 1.060793206 -0.4797576 0.63139973
## 2               lwt 0.9848478 0.007019635 -2.1750671 0.02962509
## 3               ptl 1.6863028 0.345028552  1.5144788 0.12990444
## 4               ht1 6.0111615 0.707328092  2.5357652 0.01122019
## 5             uiYes 2.3648168 0.460713659  1.8681898 0.06173562
## 6         raceblack 4.4068072 0.781746285  1.8972274 0.05779794
## 7         raceother 4.0292163 0.622143605  2.2399521 0.02509403
## 8            smoke1 4.0049510 0.625234317  2.2192181 0.02647189
## 9  raceblack:smoke1 0.9699162 1.080888164 -0.0282597 0.97745503
## 10 raceother:smoke1 0.2648653 0.936452450 -1.4186880 0.15598999