
Adam J Sullivan
Assistant Professor of Biostatistics
Brown University
Name | Description |
---|---|
age | Age in years at time of Randomization |
asa | 0 - placebo, 1 - aspirin |
bmi | Body Mass Index (kg/\(m^2\)) |
hypert | 1 - Hypertensive at baseline, 0 - Not |
alcohol | 0 - less than monthly, 1 - monthly to less than daily, 2 - daily consumption |
dm | 0 = No diabetes Mellitus, 1 - diabetes Mellitus |
sbp | Systolic BP (mmHg) |
Name | Description |
---|---|
exer | 0 - No regular, 1 - Sweat at least once per week |
csmoke | 0 - Not currently, 1 - < 1 pack per day, 2 - \(\ge\) 1 pack per day |
psmoke | 0 - never smoked, 1 - former < 1 pack per day, 2 - former \(\ge\) 1 pack per day |
pkyrs | Total lifetime packs of cigarettes smoked |
crc | 0 - No colorectal Cancer, 1 - Colorectal cancer |
cayrs | Years to colorectal cancer, or death, or end of follow-up. |
We can load this data into R.
library(tidyverse)
library(haven)
phscrc <- read_dta("phscrc.dta")
phscrc <- phscrc %>%
mutate(age.cat = cut(age, c(40,50,60,70, 90), right=FALSE)) %>%
mutate(alcohol.use = factor(phscrc$alcohol>0, labels=c("no", "yes")))
mytable <- xtabs(~age.cat+alcohol.use+ crc, phscrc)
library(pander)
pandoc.table(mytable)
crc | 0 | 1 | ||
age.cat | alcohol.use | |||
[40,50) | no | 1762 | 3 | |
yes | 4764 | 36 | ||
[50,60) | no | 1385 | 19 | |
yes | 3969 | 61 | ||
[60,70) | no | 749 | 21 | |
yes | 2167 | 73 | ||
[70,90) | no | 278 | 9 | |
yes | 690 | 32 |
We could run the following model: \[CRC = \beta_0 + \beta_1 E + \beta_2 X\]
Where \(E=1\) is the group with people who drink more than monthly and \(X\) is age in years.
\[ \begin{aligned} log&\left(\dfrac{p_{e,x+1}}{1-p_{e,x+1}}\right) - \log\left(\dfrac{p_{e,x}}{1-p_{e,x}}\right)\\ &= (\beta_0 + \beta_1e+ \beta_2(x+1)) - (\beta_0 + \beta_1e+\beta_2x)\\ &= \beta_2\\ \end{aligned} \]
library(tidyverse)
mod1 <- glm(crc ~ alcohol.use+ age, phscrc, family=binomial(link="logit"))
mod2 <- glm(crc ~ alcohol.use*age, phscrc, family=binomial(link="logit"))
phscrc <- phscrc %>%
mutate(age.cent = age - mean(age,na.rm=TRUE))
mod3 <- glm(crc ~ alcohol.use*age.cent, phscrc, family=binomial(link="logit"))
term | estimate | p.value | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | -8.3348992 | 0.0000000 | -9.1304585 | -7.5591081 |
alcohol.useyes | 0.3560274 | 0.0236256 | 0.0560273 | 0.6739932 |
age | 0.0697425 | 0.0000000 | 0.0575685 | 0.0819576 |
term | estimate | p.value | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | -8.5800802 | 0.000000 | -10.2022552 | -7.0380756 |
alcohol.useyes | 0.6722965 | 0.460657 | -1.0882345 | 2.4909110 |
age | 0.0737894 | 0.000000 | 0.0482466 | 0.0995000 |
alcohol.useyes:age | -0.0052401 | 0.723872 | -0.0344218 | 0.0238241 |
term | estimate | p.value | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | -4.6573481 | 0.000000 | -5.0165614 | -4.3396824 |
alcohol.useyes | 0.3937278 | 0.039862 | 0.0325311 | 0.7864981 |
age.cent | 0.0737894 | 0.000000 | 0.0482466 | 0.0995000 |
alcohol.useyes:age.cent | -0.0052401 | 0.723872 | -0.0344218 | 0.0238241 |
\[ \begin{aligned} LR &= 2\log\left(\dfrac{L(\text{Larger Model})}{L(\text{smaller model})}\right)\\ &= 2\log\left(L(\text{Larger Model}\right)- 2\log\left(L(\text{smaller model})\right)\\ \end{aligned} \]
\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p =0\]
null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|
2609.171 | 16017 | -1240.335 | 2486.67 | 2509.715 | 2480.67 | 16015 |
LR <- summary(mod1)$null.deviance - summary(mod1)$deviance
df <- summary(mod1)$df.null - summary(mod1)$df.residual
1- pchisq(LR,df)
## [1] 0
Many times we have categorical covariates and we need to be able to use them in model building. We can proceed two different ways depending on the data.
If we have covariate \(X\) which is categorical than we treat \(X\) differently depending on
With \(X\) which has \(K\ge2\) categories, we created \(K-1\) variables. For example:
Category | Log Odds | Probability |
---|---|---|
K | \(\beta_0\) | \(p_0 = \dfrac{\exp(\beta_0)}{1+\exp(\beta_0)}\) |
1 | \(\beta_0+\beta_1\) | \(p_0 = \dfrac{\exp(\beta_0+\beta_1)}{1+\exp(\beta_0+\beta_1)}\) |
2 | \(\beta_0+\beta_2\) | \(p_0 = \dfrac{\exp(\beta_0+\beta_2)}{1+\exp(\beta_0+\beta_2)}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
K-1 | \(\beta_0+\beta_{k-1}\) | \(p_0 = \dfrac{\exp(\beta_0+\beta_{k-1})}{1+\exp(\beta_0+\beta_{k-1})}\) |
Alcohol
and its effect of Colorectal Cancer. term | estimate | p.value | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | -4.4182131 | 0.0000000 | -4.6563690 | -4.1931173 |
alcohol | 0.2776848 | 0.0019606 | 0.1023961 | 0.4541972 |
\[ \text{Alcohol} = \begin{cases} 0 & \text{ Less than monthly}\\ 1 & \text{ Monthly to less than Daily}\\ 2 & \text{ Daily or more} \end{cases} \]
This means if
\[
\begin{aligned}
\beta_1 = 0 \Rightarrow p_0=p_1=p_2\\
\beta_1 > 0 \Rightarrow p_0
-We will find that Model 1 is nested within Model 2.
term | estimate | p.value | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | -4.3853864 | 0.0000000 | -4.6716150 | -4.1235215 |
factor(alcohol)1 | 0.2173767 | 0.1928240 | -0.1035724 | 0.5522374 |
factor(alcohol)2 | 0.5436799 | 0.0024019 | 0.1961081 | 0.8999708 |
anova_mod <-anova(mod4, mod5, test="Chisq")
kable(tidy(anova_mod))
Resid..Df | Resid..Dev | df | Deviance | p.value |
---|---|---|---|---|
16016 | 2599.51 | NA | NA | NA |
16015 | 2599.33 | 1 | 0.1807433 | 0.6707353 |