Introduction

This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 2.

You will notice that this is the same data set as last lab:

poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt",
                 header=FALSE, sep="")

colnames(poverty) <- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")

#This will print some of the beginning rows of data
head(poverty)

#R thinks gnp is factor(group) data
class(poverty$gnp)

#Replace the * values with NA
poverty[poverty=="*"] <- NA

# Tell R that gnp is numeric data 
poverty$gnp <- as.numeric(as.character(poverty$gnp))

#log.gnp
poverty$log.gnp <- log(poverty$gnp)

Data Exploration

We should always begin with a data exploration. In order to make this proceed faster in lab, below is an embedded graph that will allow you to explore the relationships of all the variables in the lab.

To get the shiny below to run, replace the header at the top with this:

---
title: "Lab 2 - Multiple Regression and Model Testing"
author: "Your Name"
output: 
  html_document
runtime: shiny
---

then run the document and your code below will allow you to explore the data more.

library(shiny)
library(ggplot2)
data <- poverty
shinyApp(
  




 
# Define UI for iris application
shinyUI(pageWithSidebar(
 
  # Application title
  headerPanel(""),
 
  # Sidebar with controls to select the variable to plot against mpg
  # and to specify whether outliers should be included
  sidebarPanel(
    selectInput("variable", "First variable:",
                list(
"birthrate", "deathrate", "infdeath",  "le.male",   "le.female", "gnp",    "log.gnp"
                                          )),
 
    selectInput("variable2", "Second variable:",
                list(
"birthrate", "deathrate", "infdeath",  "le.male",   "le.female", "gnp",    "log.gnp"
                                          ))
  ),


  mainPanel(
     h3(textOutput("caption")),
     plotOutput("plot")
  )
)),


 

 
# Define server logic required to plot variables
shinyServer(function(input, output) {
 
  # Create a reactive text
  text <- reactive({
     paste(input$variable, 'versus', input$variable2)
  })
 formulaText <- reactive({
    paste(input$variable, "~", input$variable2)
  })
  # Return as text the selected variables
  output$caption <- renderText({
    text()
  })
 
  # Generate a plot of the requested variables
 output$plot <- renderPlot({
     p <- ggplot(data, aes_string(x=input$variable, y=input$variable2)) + geom_point()
     print(p)
  })
}),

 options = list(height = 600)
  
)

Use the graphs to explore the basic relationships in the data. Make sure to make note of them.

Model Fitting

Recall the following models

fit1 <- lm(infdeath ~ le.female, data=poverty)
fit2 <- lm(infdeath ~ le.male, data=poverty)
fit3 <- lm(infdeath ~ gnp, data=poverty)
fit4 <-  lm(infdeath ~ birthrate, data=poverty)
fit5 <-  lm(infdeath ~ deathrate, data=poverty)
fit6 <- lm(infdeath~log.gnp, data=poverty)
mfit <- lm(infdeath ~ le.female + birthrate  + log.gnp, data=poverty)

Last time we finished off with mfit. We will need to consider the model fit on this now. It will be our goal to use various plots to check the fit of our model and then we will use automated model procedures to see if this model is what R would automatically build.

The following code will create standardized residual plots. Use these to describe what you see about the model.
1. Is the model a correct fit?
2. Does the homogenous variance assumption hold?

library(olsrr)
ols_rvsp_plot(mfit)
ols_score_test(mfit)
ols_f_test(mfit)

Create a marginal model plot for mfit. Does this fit our data well?

#blue line represents a loess(smoothing) line for the data and the dashed line represents 
#the model which R fitted
mmps(mfit)

Create a marginal model plot for mfit. Does this fit our data well?
Plot the cooks distance, DFBETAs and DFFITs.
What do you notice?
We can see that above highlights 5 possible values for outliers. Once doing this run a new model called mfit.ex.outliers which excludes outliers. The code to remove the outliers is as follows:

outliers <- c(18, 41, 55, 76, 79, 90)
poverty2 <- poverty[-outliers,]

a. Does the summary of `mfit.ex.outliers` differ much from `mift`? 
b. What happens when you run the cooks distance plot on this data with outliers removed?
c. Use residual plots and `mmps()` to evaluate the model fit after deleting the outliers. 
d. Does this new model fit the data better with the outliers removed?

Above shows us that there is more involved than just cook’s distance and removing outliers. We can get stuck in an endless loop of removing things and each time we find that our model fit decreases and we have more and more outliers. We will proceed with the model mfit since out regression results did not change much and our fit got worse as we went.

Fit the following models:

fit.best.2 <- lm(infdeath ~ le.female + le.male, data=poverty)
fit.best.3 <- lm(infdeath ~ le.female + le.male + birthrate, data=poverty)
fit.best.4 <- lm(infdeath ~ le.female + le.male + birthrate + log.gnp, data=poverty)

Compare the adjusted \(R^2\) values from these models. Which one would you consider the best and why?
Use the anova() function in R to preform an \(F\)-test comparing these models. Which model would you choose based on this? Does it confirm your model chosen in question 9?
On your own With your final model run the model diagnostic plots again and see how well this fits your data.