This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 2.
You will notice that this is the same data set as last lab:
poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt",
header=FALSE, sep="")
colnames(poverty) <- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")
#This will print some of the beginning rows of data
head(poverty)
## birthrate deathrate infdeath le.male le.female gnp group
## 1 24.7 5.7 30.8 69.6 75.5 600 1
## 2 12.5 11.9 14.4 68.3 74.7 2250 1
## 3 13.4 11.7 11.3 71.8 77.7 2980 1
## 4 12.0 12.4 7.6 69.8 75.9 * 1
## 5 11.6 13.4 14.8 65.4 73.8 2780 1
## 6 14.3 10.2 16.0 67.2 75.7 1690 1
## country
## 1 Albania
## 2 Bulgaria
## 3 Czechoslovakia
## 4 Former_E._Germany
## 5 Hungary
## 6 Poland
#R thinks gnp is factor(group) data
#class(poverty$gnp)
#Replace the * values with NA
poverty[poverty=="*"] <- NA
# Tell R that gnp is numeric data
poverty$gnp <- as.numeric(as.character(poverty$gnp))
#log.gnp
poverty$log.gnp <- log(poverty$gnp)
We should always begin with a data exploration. In order to make this proceed faster in lab, below is an embedded graph that will allow you to explore the relationships of all the variables in the lab.
To get the shiny below to run, replace the header at the top with this:
---
title: "Lab 2 - Multiple Regression and Model Testing"
author: "Your Name"
output:
html_document
runtime: shiny
---
then run the document and your code below will allow you to explore the data more.
library(shiny)
library(ggplot2)
data <- poverty
shinyApp(
# Define UI for iris application
shinyUI(pageWithSidebar(
# Application title
headerPanel(""),
# Sidebar with controls to select the variable to plot against mpg
# and to specify whether outliers should be included
sidebarPanel(
selectInput("variable", "First variable:",
list(
"birthrate", "deathrate", "infdeath", "le.male", "le.female", "gnp", "log.gnp"
)),
selectInput("variable2", "Second variable:",
list(
"birthrate", "deathrate", "infdeath", "le.male", "le.female", "gnp", "log.gnp"
))
),
mainPanel(
h3(textOutput("caption")),
plotOutput("plot")
)
)),
# Define server logic required to plot variables
shinyServer(function(input, output) {
# Create a reactive text
text <- reactive({
paste(input$variable, 'versus', input$variable2)
})
formulaText <- reactive({
paste(input$variable, "~", input$variable2)
})
# Return as text the selected variables
output$caption <- renderText({
text()
})
# Generate a plot of the requested variables
output$plot <- renderPlot({
p <- ggplot(data, aes_string(x=input$variable, y=input$variable2)) + geom_point()
print(p)
})
}),
options = list(height = 600)
)
Recall the following models
fit1 <- lm(infdeath ~ le.female, data=poverty)
fit2 <- lm(infdeath ~ le.male, data=poverty)
fit3 <- lm(infdeath ~ gnp, data=poverty)
fit4 <- lm(infdeath ~ birthrate, data=poverty)
fit5 <- lm(infdeath ~ deathrate, data=poverty)
fit6 <- lm(infdeath~log.gnp, data=poverty)
mfit <- lm(infdeath ~ le.female + birthrate + log.gnp, data=poverty)
Last time we finished off with mfit
. We will need to consider the model fit on this now. It will be our goal to use various plots to check the fit of our model and then we will use automated model procedures to see if this model is what R would automatically build.
library(olsrr)
ols_plot_resid_fit(mfit)
ols_test_score(mfit)
##
## Score Test for Heteroskedasticity
## ---------------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of infdeath
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.967767
## Prob > Chi2 = 0.04637925
ols_test_f(mfit)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of infdeath
##
## Test Summary
## --------------------------
## Num DF = 1
## Den DF = 89
## F = 4.057476
## Prob > F = 0.04699519
a.The model is a correct fit.
b.From the residual plot and the score test, the homogenous variance assumption is not held.
mfit
. Does this fit our data well?library(car)
#blue line represents a loess(smoothing) line for the data and the dashed line represents
#the model which R fitted
mmps(mfit)
Yes, this model fit the data well.
ols_plot_cooksd_bar(mfit)
ols_plot_dfbetas(mfit)
ols_plot_dffits(mfit)
There are some outliers in this dataset when fitting the model.
mfit.ex.outliers
which excludes outliers. The code to remove the outliers is as follows:outliers <- c(18, 41, 48, 55, 76, 79, 90)
poverty2 <- poverty[-outliers,]
mfit.ex.outliers<-lm(infdeath ~ le.female + birthrate + log.gnp, data = poverty2)
library(broom)
tidy.mfit<-tidy(mfit)
tidy.mfit.ex.outlier<-tidy(mfit.ex.outliers)
rbind(tidy.mfit,tidy.mfit.ex.outlier)
## term estimate std.error statistic p.value
## 1 (Intercept) 312.20423719 26.3243755 11.85989151 7.242441e-20
## 2 le.female -3.92846115 0.3556308 -11.04645922 3.054395e-18
## 3 birthrate 0.05135414 0.2422885 0.21195449 8.326383e-01
## 4 log.gnp 0.12803497 1.5903078 0.08050955 9.360169e-01
## 5 (Intercept) 311.21743408 27.6927667 11.23822106 3.945707e-18
## 6 le.female -3.94982592 0.3760349 -10.50388106 1.015906e-16
## 7 birthrate 0.05908817 0.2534429 0.23314195 8.162467e-01
## 8 log.gnp 0.39238486 1.6805894 0.23348050 8.159847e-01
glance(mfit)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.9105986 0.9075158 14.0811 295.3796 1.733962e-45 4 -367.758
## AIC BIC deviance df.residual
## 1 745.5159 758.0702 17250.13 87
glance(mfit.ex.outliers)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.9058432 0.9023123 14.36912 256.5488 6.175445e-41 4 -341.0085
## AIC BIC deviance df.residual
## 1 692.017 704.171 16517.72 80
a. Does the summary of `mfit.ex.outliers` differ much from `mift`?
The summary of mfit.ex.outliers
does not differ much from mift
.
b. What happens when you run the cooks distance plot on this data with outliers removed?
ols_plot_cooksd_bar(mfit.ex.outliers)
There are still outliers in the model.
c. Use residual plots and `mmps()` to evaluate the model fit after deleting the outliers.
ols_plot_resid_fit(mfit.ex.outliers)
ols_test_score(mfit.ex.outliers)
##
## Score Test for Heteroskedasticity
## ---------------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of infdeath
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 4.254808
## Prob > Chi2 = 0.03913938
ols_test_f(mfit.ex.outliers)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of infdeath
##
## Test Summary
## --------------------------
## Num DF = 1
## Den DF = 82
## F = 4.375113
## Prob > F = 0.03956154
mmps(mfit.ex.outliers)
The model still doesn’t meet the homogenous variance assumption. The fitting plots shows that the mfit.ex.outliers
does not fit the data better than mfit
.
d. Does this new model fit the data better with the outliers removed?
Above shows us that there is more involved than just cook’s distance and removing outliers. We can get stuck in an endless loop of removing things and each time we find that our model fit decreases and we have more and more outliers. We will proceed with the model mfit
since our regression results did not change much and our fit got worse as we went.
poverty3<-na.omit(poverty)
fit.best.2 <- lm(infdeath ~ le.female + le.male, data=poverty3)
fit.best.3 <- lm(infdeath ~ le.female + le.male + birthrate, data=poverty3)
fit.best.4 <- lm(infdeath ~ le.female + le.male + birthrate + log.gnp, data=poverty3)
glance(fit.best.2)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.9106334 0.9086023 13.99814 448.3538 7.107487e-47 3 -367.7403
## AIC BIC deviance df.residual
## 1 743.4805 753.524 17243.42 88
glance(fit.best.3)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.9106638 0.9075832 14.07597 295.6164 1.679852e-45 4 -367.7248
## AIC BIC deviance df.residual
## 1 745.4495 758.0038 17237.55 87
glance(fit.best.4)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.9106706 0.9065157 14.15703 219.1822 3.137215e-44 5 -367.7213
## AIC BIC deviance df.residual
## 1 747.4426 762.5078 17236.24 86
fit.best.2
is the best since it have the largest adjusted R square.
anova()
function in R to preform an \(F\)-test comparing these models. Which model would you choose based on this? Does it confirm your model chosen in question 9?anova(fit.best.2,fit.best.3,fit.best.4)
## Analysis of Variance Table
##
## Model 1: infdeath ~ le.female + le.male
## Model 2: infdeath ~ le.female + le.male + birthrate
## Model 3: infdeath ~ le.female + le.male + birthrate + log.gnp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 88 17243
## 2 87 17238 1 5.8676 0.0293 0.8645
## 3 86 17236 1 1.3125 0.0065 0.9357
anova(fit.best.2,fit.best.3)
## Analysis of Variance Table
##
## Model 1: infdeath ~ le.female + le.male
## Model 2: infdeath ~ le.female + le.male + birthrate
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 88 17243
## 2 87 17238 1 5.8676 0.0296 0.8638
anova(fit.best.3,fit.best.4)
## Analysis of Variance Table
##
## Model 1: infdeath ~ le.female + le.male + birthrate
## Model 2: infdeath ~ le.female + le.male + birthrate + log.gnp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 87 17238
## 2 86 17236 1 1.3125 0.0065 0.9357
anova(fit.best.2,fit.best.4)
## Analysis of Variance Table
##
## Model 1: infdeath ~ le.female + le.male
## Model 2: infdeath ~ le.female + le.male + birthrate + log.gnp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 88 17243
## 2 86 17236 2 7.18 0.0179 0.9823
The anova test shows that the fit.best.3
and fit.best.4
are not significantly better than fit.best.2
. Thus, fit.best.2
is the best model, it confirms the model chosen in question 9.
ols_plot_resid_fit(fit.best.2)
ols_test_score(fit.best.2)
##
## Score Test for Heteroskedasticity
## ---------------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of infdeath
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.515315
## Prob > Chi2 = 0.06080409
ols_test_f(fit.best.2)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of infdeath
##
## Test Summary
## --------------------------
## Num DF = 1
## Den DF = 89
## F = 3.576204
## Prob > F = 0.06186585
mmps(fit.best.2)
The final model hold the homgenous variance assumption. And it fits the data well.