Introduction

This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 2.

You will notice that this is the same data set as last lab:

poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt",
                 header=FALSE, sep="")

colnames(poverty) <- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")

#This will print some of the beginning rows of data
head(poverty)

##   birthrate deathrate infdeath le.male le.female  gnp group
## 1      24.7       5.7     30.8    69.6      75.5  600     1
## 2      12.5      11.9     14.4    68.3      74.7 2250     1
## 3      13.4      11.7     11.3    71.8      77.7 2980     1
## 4      12.0      12.4      7.6    69.8      75.9    *     1
## 5      11.6      13.4     14.8    65.4      73.8 2780     1
## 6      14.3      10.2     16.0    67.2      75.7 1690     1
##             country
## 1           Albania
## 2          Bulgaria
## 3    Czechoslovakia
## 4 Former_E._Germany
## 5           Hungary
## 6            Poland

#R thinks gnp is factor(group) data
#class(poverty$gnp)

#Replace the * values with NA
poverty[poverty=="*"] <- NA

# Tell R that gnp is numeric data 
poverty$gnp <- as.numeric(as.character(poverty$gnp))

#log.gnp
poverty$log.gnp <- log(poverty$gnp)

Data Exploration

We should always begin with a data exploration. In order to make this proceed faster in lab, below is an embedded graph that will allow you to explore the relationships of all the variables in the lab.

To get the shiny below to run, replace the header at the top with this:

---
title: "Lab 2 - Multiple Regression and Model Testing"
author: "Your Name"
output: 
  html_document
runtime: shiny
---

then run the document and your code below will allow you to explore the data more.

library(shiny)
library(ggplot2)
data <- poverty
shinyApp(
  




 
# Define UI for iris application
shinyUI(pageWithSidebar(
 
  # Application title
  headerPanel(""),
 
  # Sidebar with controls to select the variable to plot against mpg
  # and to specify whether outliers should be included
  sidebarPanel(
    selectInput("variable", "First variable:",
                list(
"birthrate", "deathrate", "infdeath",  "le.male",   "le.female", "gnp",    "log.gnp"
                                          )),
 
    selectInput("variable2", "Second variable:",
                list(
"birthrate", "deathrate", "infdeath",  "le.male",   "le.female", "gnp",    "log.gnp"
                                          ))
  ),


  mainPanel(
     h3(textOutput("caption")),
     plotOutput("plot")
  )
)),


 

 
# Define server logic required to plot variables
shinyServer(function(input, output) {
 
  # Create a reactive text
  text <- reactive({
     paste(input$variable, 'versus', input$variable2)
  })
 formulaText <- reactive({
    paste(input$variable, "~", input$variable2)
  })
  # Return as text the selected variables
  output$caption <- renderText({
    text()
  })
 
  # Generate a plot of the requested variables
 output$plot <- renderPlot({
     p <- ggplot(data, aes_string(x=input$variable, y=input$variable2)) + geom_point()
     print(p)
  })
}),

 options = list(height = 600)
  
)

Use the graphs to explore the basic relationships in the data. Make sure to make note of them.

Model Fitting

Recall the following models

fit1 <- lm(infdeath ~ le.female, data=poverty)
fit2 <- lm(infdeath ~ le.male, data=poverty)
fit3 <- lm(infdeath ~ gnp, data=poverty)
fit4 <-  lm(infdeath ~ birthrate, data=poverty)
fit5 <-  lm(infdeath ~ deathrate, data=poverty)
fit6 <- lm(infdeath~log.gnp, data=poverty)
mfit <- lm(infdeath ~ le.female + birthrate  + log.gnp, data=poverty)

Last time we finished off with mfit. We will need to consider the model fit on this now. It will be our goal to use various plots to check the fit of our model and then we will use automated model procedures to see if this model is what R would automatically build.

The following code will create standardized residual plots. Use these to describe what you see about the model.
1. Is the model a correct fit?
2. Does the homogenous variance assumption hold?

library(olsrr)
ols_plot_resid_fit(mfit)

ols_test_score(mfit)

## 
##  Score Test for Heteroskedasticity
##  ---------------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of infdeath 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.967767 
##  Prob > Chi2   =    0.04637925

ols_test_f(mfit)

## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of infdeath 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    1 
##  Den DF     =    89 
##  F          =    4.057476 
##  Prob > F   =    0.04699519

a.The model is a correct fit.

b.From the residual plot and the score test, the homogenous variance assumption is not held.

Create a marginal model plot for mfit. Does this fit our data well?

library(car)
#blue line represents a loess(smoothing) line for the data and the dashed line represents 
#the model which R fitted
mmps(mfit)

Yes, this model fit the data well.

Plot the cooks distance, DFBETAs and DFFITs.

ols_plot_cooksd_bar(mfit)

ols_plot_dfbetas(mfit)

ols_plot_dffits(mfit)

What do you notice?

There are some outliers in this dataset when fitting the model.

We can see that above highlights 6 possible values for outliers. Once doing this run a new model called mfit.ex.outliers which excludes outliers. The code to remove the outliers is as follows:

outliers <- c(18, 41, 48, 55, 76, 79, 90)
poverty2 <- poverty[-outliers,]

mfit.ex.outliers<-lm(infdeath ~ le.female + birthrate + log.gnp, data = poverty2)

library(broom)
tidy.mfit<-tidy(mfit)
tidy.mfit.ex.outlier<-tidy(mfit.ex.outliers)
rbind(tidy.mfit,tidy.mfit.ex.outlier)

##          term     estimate  std.error    statistic      p.value
## 1 (Intercept) 312.20423719 26.3243755  11.85989151 7.242441e-20
## 2   le.female  -3.92846115  0.3556308 -11.04645922 3.054395e-18
## 3   birthrate   0.05135414  0.2422885   0.21195449 8.326383e-01
## 4     log.gnp   0.12803497  1.5903078   0.08050955 9.360169e-01
## 5 (Intercept) 311.21743408 27.6927667  11.23822106 3.945707e-18
## 6   le.female  -3.94982592  0.3760349 -10.50388106 1.015906e-16
## 7   birthrate   0.05908817  0.2534429   0.23314195 8.162467e-01
## 8     log.gnp   0.39238486  1.6805894   0.23348050 8.159847e-01

glance(mfit)

##   r.squared adj.r.squared   sigma statistic      p.value df   logLik
## 1 0.9105986     0.9075158 14.0811  295.3796 1.733962e-45  4 -367.758
##        AIC      BIC deviance df.residual
## 1 745.5159 758.0702 17250.13          87

glance(mfit.ex.outliers)

##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.9058432     0.9023123 14.36912  256.5488 6.175445e-41  4 -341.0085
##       AIC     BIC deviance df.residual
## 1 692.017 704.171 16517.72          80

a. Does the summary of `mfit.ex.outliers` differ much from `mift`?

The summary of mfit.ex.outliers does not differ much from mift.

b. What happens when you run the cooks distance plot on this data with outliers removed?

ols_plot_cooksd_bar(mfit.ex.outliers)

There are still outliers in the model.

c. Use residual plots and `mmps()` to evaluate the model fit after deleting the outliers.

ols_plot_resid_fit(mfit.ex.outliers)

ols_test_score(mfit.ex.outliers)

## 
##  Score Test for Heteroskedasticity
##  ---------------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of infdeath 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    4.254808 
##  Prob > Chi2   =    0.03913938

ols_test_f(mfit.ex.outliers)

## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of infdeath 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    1 
##  Den DF     =    82 
##  F          =    4.375113 
##  Prob > F   =    0.03956154

mmps(mfit.ex.outliers)

The model still doesn’t meet the homogenous variance assumption. The fitting plots shows that the mfit.ex.outliers does not fit the data better than mfit.

d. Does this new model fit the data better with the outliers removed?

Above shows us that there is more involved than just cook’s distance and removing outliers. We can get stuck in an endless loop of removing things and each time we find that our model fit decreases and we have more and more outliers. We will proceed with the model mfit since our regression results did not change much and our fit got worse as we went.

Fit the following models:

poverty3<-na.omit(poverty)
fit.best.2 <- lm(infdeath ~ le.female + le.male, data=poverty3)
fit.best.3 <- lm(infdeath ~ le.female + le.male + birthrate, data=poverty3)
fit.best.4 <- lm(infdeath ~ le.female + le.male + birthrate + log.gnp, data=poverty3)

Compare the adjusted \(R^2\) values from these models. Which one would you consider the best and why?

glance(fit.best.2)

##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.9106334     0.9086023 13.99814  448.3538 7.107487e-47  3 -367.7403
##        AIC     BIC deviance df.residual
## 1 743.4805 753.524 17243.42          88

glance(fit.best.3)

##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.9106638     0.9075832 14.07597  295.6164 1.679852e-45  4 -367.7248
##        AIC      BIC deviance df.residual
## 1 745.4495 758.0038 17237.55          87

glance(fit.best.4)

##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.9106706     0.9065157 14.15703  219.1822 3.137215e-44  5 -367.7213
##        AIC      BIC deviance df.residual
## 1 747.4426 762.5078 17236.24          86

fit.best.2 is the best since it have the largest adjusted R square.

Use the anova() function in R to preform an \(F\)-test comparing these models. Which model would you choose based on this? Does it confirm your model chosen in question 9?

anova(fit.best.2,fit.best.3,fit.best.4)

## Analysis of Variance Table
## 
## Model 1: infdeath ~ le.female + le.male
## Model 2: infdeath ~ le.female + le.male + birthrate
## Model 3: infdeath ~ le.female + le.male + birthrate + log.gnp
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     88 17243                           
## 2     87 17238  1    5.8676 0.0293 0.8645
## 3     86 17236  1    1.3125 0.0065 0.9357

anova(fit.best.2,fit.best.3)

## Analysis of Variance Table
## 
## Model 1: infdeath ~ le.female + le.male
## Model 2: infdeath ~ le.female + le.male + birthrate
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     88 17243                           
## 2     87 17238  1    5.8676 0.0296 0.8638

anova(fit.best.3,fit.best.4)

## Analysis of Variance Table
## 
## Model 1: infdeath ~ le.female + le.male + birthrate
## Model 2: infdeath ~ le.female + le.male + birthrate + log.gnp
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     87 17238                           
## 2     86 17236  1    1.3125 0.0065 0.9357

anova(fit.best.2,fit.best.4)

## Analysis of Variance Table
## 
## Model 1: infdeath ~ le.female + le.male
## Model 2: infdeath ~ le.female + le.male + birthrate + log.gnp
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     88 17243                           
## 2     86 17236  2      7.18 0.0179 0.9823

The anova test shows that the fit.best.3 and fit.best.4 are not significantly better than fit.best.2. Thus, fit.best.2 is the best model, it confirms the model chosen in question 9.

On your own With your final model run the model diagnostic plots again and see how well this fits your data.

ols_plot_resid_fit(fit.best.2)

ols_test_score(fit.best.2)

## 
##  Score Test for Heteroskedasticity
##  ---------------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of infdeath 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.515315 
##  Prob > Chi2   =    0.06080409

ols_test_f(fit.best.2)

## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of infdeath 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    1 
##  Den DF     =    89 
##  F          =    3.576204 
##  Prob > F   =    0.06186585

mmps(fit.best.2)

The final model hold the homgenous variance assumption. And it fits the data well.