The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.
Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.
The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected. This data is from Hosmer et al. , 2013.
Variable Name | Description |
---|---|
id | Identification Code |
low | 0 = Birthweight \(\ge\) 2500g |
1=Birthweight< 2500g | |
age | Age of mother in years |
lwt | Weight in Pounds at last menstrual period |
race | 1 = white |
2 = black | |
3 = other | |
ptl | History of Premature Labor (0=none, 1= One, …) |
ht | History of hyptertension |
ui | Presence of Uterine Irritability |
0 = No | |
1 = Yes | |
ftv | Number of Physician visits during first trimester (0=none, 1=One, …) |
btw | Birth weight in grams |
You can read the data in with the command below.
low.weight <- read.table("https://drive.google.com/uc?export=download&id=0B8CsRLdwqzbzMzJyVkt5QkdvVnM", header=TRUE, sep=",")
low
.For exploring the relationship between low
and continuous variables, we can use boxplots. For exploring the relationship between low and categorical variables, we can: 1. create contingency tables which tabulate the counts of each category; 2. display the proportion of each level of the categorical variable by using barplot.
s1 <- apply(low.weight, 2, summary)
knitr::kable(s1)
#or we can use
summary(low.weight)
#From the summry, we can see race, smoke, ht and ui are treated as continuous variables.
id | low | age | lwt | race | smoke | ptl | ht | ui | ftv | bwt | |
---|---|---|---|---|---|---|---|---|---|---|---|
Min. | 4.0000 | 0.0000000 | 14.0000 | 80.0000 | 1.000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 709.000 |
1st Qu. | 68.0000 | 0.0000000 | 19.0000 | 110.0000 | 1.000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 2414.000 |
Median | 123.0000 | 0.0000000 | 23.0000 | 121.0000 | 1.000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 2977.000 |
Mean | 121.0794 | 0.3121693 | 23.2381 | 129.8148 | 1.846561 | 0.3915344 | 0.1957672 | 0.0634921 | 0.1481481 | 0.7936508 | 2944.656 |
3rd Qu. | 176.0000 | 1.0000000 | 26.0000 | 140.0000 | 3.000000 | 1.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 1.0000000 | 3475.000 |
Max. | 226.0000 | 1.0000000 | 45.0000 | 250.0000 | 3.000000 | 1.0000000 | 3.0000000 | 1.0000000 | 1.0000000 | 6.0000000 | 4990.000 |
## id low age lwt
## Min. : 4.0 Min. :0.0000 Min. :14.00 Min. : 80.0
## 1st Qu.: 68.0 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0
## Median :123.0 Median :0.0000 Median :23.00 Median :121.0
## Mean :121.1 Mean :0.3122 Mean :23.24 Mean :129.8
## 3rd Qu.:176.0 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :226.0 Max. :1.0000 Max. :45.00 Max. :250.0
## race smoke ptl ht
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :1.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :1.847 Mean :0.3915 Mean :0.1958 Mean :0.06349
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :3.000 Max. :1.0000 Max. :3.0000 Max. :1.00000
## ui ftv bwt
## Min. :0.0000 Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :0.0000 Median :2977
## Mean :0.1481 Mean :0.7937 Mean :2945
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3475
## Max. :1.0000 Max. :6.0000 Max. :4990
library(ggplot2)
library(gridExtra)
numerical_var <- c("age","lwt","ptl","ht","ftv")
categorical_var <- c("low","race","ui","smoke")
low.weight$low<-as.factor(low.weight$low)
p1 <- ggplot(low.weight, aes(x = low, y = age)) + geom_boxplot()
p2 <- ggplot(low.weight, aes(x = low, y = lwt)) + geom_boxplot()
p3 <- ggplot(low.weight, aes(x = low, y = ptl)) + geom_boxplot()
p4 <- ggplot(low.weight, aes(x = low, y = ftv)) + geom_boxplot()
grid.arrange(p1,p2,p3,p4,ncol=2)
# Alternative way:
# par(mfrow=c(2,2))
# boxplot(age~low,data=low.weight,ylab="Age",xlab="Low")
# boxplot(lwt~low,data=low.weight,ylab="lwt",xlab="Low")
# boxplot(ptl~low,data=low.weight,ylab="ptl",xlab="Low")
# boxplot(ftv~low,data=low.weight,ylab="ftv",xlab="Low")
# Example: low vs race
#create barplots
T1<-table(low.weight$low,low.weight$race)
#get proportions of normal/low weight within each race (2 represents column indicator)
par(mfrow=c(1,1))
prop.table(T1,2)
barplot(prop.table(T1,2),legend.text = c("Normal","Low"),names.arg = c("white","black","other"))
#get proportions of different races within each weight category(2 represents column indicator)
T2<-table(low.weight$race,low.weight$low)
T2
barplot(prop.table(T2,2),legend.text = c("white","black","other"),names.arg = c("Normal","Low"))
##
## 1 2 3
## 0 0.7604167 0.5769231 0.6268657
## 1 0.2395833 0.4230769 0.3731343
##
## 0 1
## 1 73 23
## 2 15 11
## 3 42 25
low
, race
and ui
are categorical variables but they are not yet factors. Code them in R to be factors in the data. Then make sure they have correct level names.library(tidyverse)
low.weight<-low.weight %>%
mutate(low=as.factor(low))%>%
mutate(race=factor(race,levels=c(1,2,3),labels=c("white","black","other")))%>%
mutate(ui=factor(ui,levels=c(0,1),labels=c("No","Yes"))) %>%
mutate(smoke=as.factor(smoke))%>%
mutate(ht=as.factor(ht))
#Alternative way:
# low.weight$race<-factor(low.weight$race,levels=c(1,2,3),labels=c("white","black","other"))
# low.weight$smoke<-as.factor(low.weight$smoke)
tidy()
commands).library(broom)
# Here, we fit 8 simple logistic regression models with a loop function.
result <- c()
for (var in c("age","lwt","race","smoke","ptl","ht","ui","ftv","bwt")) {
f <- as.formula(paste0("low~", var))
m <- glm(f, data = low.weight, family = binomial(link = "logit"))
r <- tidy(m, exponentiate = T, conf.int = T)[-1,-c(3,4)]
result <- rbind(result,r)
}
result
# Eample: age
# fit.age<-glm(low~age,data=low.weight,family=binomial(link="logit"))
# tidy(fit.age)
# tidy(fit.age,exponentiate = T, conf.int = T)
## term estimate p.value conf.low conf.high
## 2 age 0.9501333 0.10454800 8.912949e-01 1.009027e+00
## 21 lwt 0.9860401 0.02268856 9.733982e-01 9.973535e-01
## 22 raceblack 2.3275362 0.06830125 9.255511e-01 5.774700e+00
## 3 raceother 1.8892340 0.06740445 9.565420e-01 3.757871e+00
## 23 smoke1 2.0219436 0.02761962 1.081872e+00 3.800582e+00
## 24 ptl 2.2295634 0.01146709 1.217895e+00 4.283620e+00
## 25 ht1 3.3653846 0.04606314 1.028381e+00 1.182912e+01
## 26 uiYes 2.5777778 0.02308380 1.133202e+00 5.881263e+00
## 27 ftv 0.8736112 0.38852661 6.320298e-01 1.174074e+00
## 28 bwt 0.3054851 0.98915608 2.747029e-13 4.000659e-08
Based on p-values, lwt
, smoke
, ptl
, ht
and ui
are statistically significant at 0.05 significant level, and race
is significant at 0.1 significant level.
fit1<-glm(low~smoke*race,data=low.weight,family=binomial("logit"))
prob<-predict(fit1,type="response")
interaction.plot(x.factor = low.weight$smoke,trace.factor=low.weight$race,response = prob,xlab="Smoking status",ylab="Probability of haveing low birthweight babies.")
# An alternative way:
# library(Hmisc)
# plsmo(low.weight$smoke, low.weight$low, datadensity = T, group = low.weight$race,
# col=c('black', 'red','blue'), xlab = 'Smoke', ylab ='Low',
# ylim = c(0,1))
tidy(fit1,conf.int = T)[,-c(3:4)]
## term estimate p.value conf.low conf.high
## 1 (Intercept) -2.3025850 1.124386e-05 -3.50598590 -1.3945304
## 2 smoke1 1.7505164 3.429174e-03 0.66242458 3.0618216
## 3 raceblack 1.5141276 4.412016e-02 0.03896048 3.0587366
## 4 raceother 1.7429692 3.371066e-03 0.66364168 3.0487627
## 5 smoke1:raceblack -0.5565939 5.897208e-01 -2.60189233 1.4968087
## 6 smoke1:raceother -1.5273729 8.358729e-02 -3.34321787 0.1575523
From the interaction plot, we can see the probability of having low birthweight babies for people in the same race is different by their smoking status. Therefore, there is an interaction between smoking and race.
Based on the original outputs (without exponentiating the coefficients), we can interpret the interaction as:
1. For White people, the log odds ratio between smokers and non-smokers is 1.75.
2. For people whose race is black, the log odds ratio between smokers and non-smokers is 1.75-0.55=1.2.
3. For people whose race is other, the log odds ratio between smokers and non-smokers is 1.75-1.52=0.23.
4. For smokers, the log odds ratio between black and white is 1.514-0.55=0.964, the log odds ratio between other and white is 1.74-1.52=0.22.
tidy(fit1,exponentiate = T,conf.int = T)[,-c(3:4)]
## term estimate p.value conf.low conf.high
## 1 (Intercept) 0.1000000 1.124386e-05 0.03001716 0.2479494
## 2 smoke1 5.7575752 3.429174e-03 1.93948909 21.3664428
## 3 raceblack 4.5454541 4.412016e-02 1.03972939 21.3006300
## 4 raceother 5.7142851 3.371066e-03 1.94185108 21.0892350
## 5 smoke1:raceblack 0.5731580 5.897208e-01 0.07413316 4.4674095
## 6 smoke1:raceother 0.2171053 8.358729e-02 0.03532311 1.1706420
Based on the exponentiated outputs, we can interpret the interaction as:
1. For White people, the odds ratio between smokers and non-smokers is 5.75.
2. For people whose race is black, the odds ratio between smokers and non-smokers is \(5.75\times 0.573=3.28\)
3. For people whose race is other, the odds ratio between smokers and non-smokers is \(5.75\times 0.217=1.248\)
4. For smokers, the odds ratio between black and white is \(4.545\times 0.573=2.60\), the odds ratio between other and white is \(5.71\times 0.217=1.24\)
From problem 4 and 5, we would include predictors that are statistically significant at 0.1 significant level and the interaction term between smoking and race in the multiple regression model.
fit2<-glm(low~lwt+ptl+ht+ui+race*smoke,data=low.weight,family=binomial)
tidy(fit2,exponentiate = T)
## term estimate std.error statistic p.value
## 1 (Intercept) 0.6011423 1.060793206 -0.4797576 0.63139973
## 2 lwt 0.9848478 0.007019635 -2.1750671 0.02962509
## 3 ptl 1.6863028 0.345028552 1.5144788 0.12990444
## 4 ht1 6.0111615 0.707328092 2.5357652 0.01122019
## 5 uiYes 2.3648168 0.460713659 1.8681898 0.06173562
## 6 raceblack 4.4068072 0.781746285 1.8972274 0.05779794
## 7 raceother 4.0292163 0.622143605 2.2399521 0.02509403
## 8 smoke1 4.0049510 0.625234317 2.2192181 0.02647189
## 9 raceblack:smoke1 0.9699162 1.080888164 -0.0282597 0.97745503
## 10 raceother:smoke1 0.2648653 0.936452450 -1.4186880 0.15598999