This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 2.
You will notice that this is the same data set as last lab:
poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt",
header=FALSE, sep="")
colnames(poverty) <- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")
#This will print some of the beginning rows of data
head(poverty)
#R thinks gnp is factor(group) data
class(poverty$gnp)
#Replace the * values with NA
poverty[poverty=="*"] <- NA
# Tell R that gnp is numeric data
poverty$gnp <- as.numeric(as.character(poverty$gnp))
#log.gnp
poverty$log.gnp <- log(poverty$gnp)
We should always begin with a data exploration. In order to make this proceed faster in lab, below is an embedded graph that will allow you to explore the relationships of all the variables in the lab.
To get the shiny below to run, replace the header at the top with this:
---
title: "Lab 2 - Multiple Regression and Model Testing"
author: "Your Name"
output:
html_document
runtime: shiny
---
then run the document and your code below will allow you to explore the data more.
library(shiny)
library(ggplot2)
data <- poverty
shinyApp(
# Define UI for iris application
shinyUI(pageWithSidebar(
# Application title
headerPanel(""),
# Sidebar with controls to select the variable to plot against mpg
# and to specify whether outliers should be included
sidebarPanel(
selectInput("variable", "First variable:",
list(
"birthrate", "deathrate", "infdeath", "le.male", "le.female", "gnp", "log.gnp"
)),
selectInput("variable2", "Second variable:",
list(
"birthrate", "deathrate", "infdeath", "le.male", "le.female", "gnp", "log.gnp"
))
),
mainPanel(
h3(textOutput("caption")),
plotOutput("plot")
)
)),
# Define server logic required to plot variables
shinyServer(function(input, output) {
# Create a reactive text
text <- reactive({
paste(input$variable, 'versus', input$variable2)
})
formulaText <- reactive({
paste(input$variable, "~", input$variable2)
})
# Return as text the selected variables
output$caption <- renderText({
text()
})
# Generate a plot of the requested variables
output$plot <- renderPlot({
p <- ggplot(data, aes_string(x=input$variable, y=input$variable2)) + geom_point()
print(p)
})
}),
options = list(height = 600)
)
Recall the following models
fit1 <- lm(infdeath ~ le.female, data=poverty)
fit2 <- lm(infdeath ~ le.male, data=poverty)
fit3 <- lm(infdeath ~ gnp, data=poverty)
fit4 <- lm(infdeath ~ birthrate, data=poverty)
fit5 <- lm(infdeath ~ deathrate, data=poverty)
fit6 <- lm(infdeath~log.gnp, data=poverty)
mfit <- lm(infdeath ~ le.female + birthrate + log.gnp, data=poverty)
Last time we finished off with mfit
. We will need to consider the model fit on this now. It will be our goal to use various plots to check the fit of our model and then we will use automated model procedures to see if this model is what R would automatically build.
library(olsrr)
ols_rvsp_plot(mfit)
ols_score_test(mfit)
ols_f_test(mfit)
mfit
. Does this fit our data well?#blue line represents a loess(smoothing) line for the data and the dashed line represents
#the model which R fitted
mmps(mfit)
mfit
. Does this fit our data well?mfit.ex.outliers
which excludes outliers. The code to remove the outliers is as follows:outliers <- c(18, 41, 55, 76, 79, 90)
poverty2 <- poverty[-outliers,]
a. Does the summary of `mfit.ex.outliers` differ much from `mift`?
b. What happens when you run the cooks distance plot on this data with outliers removed?
c. Use residual plots and `mmps()` to evaluate the model fit after deleting the outliers.
d. Does this new model fit the data better with the outliers removed?
Above shows us that there is more involved than just cook’s distance and removing outliers. We can get stuck in an endless loop of removing things and each time we find that our model fit decreases and we have more and more outliers. We will proceed with the model mfit
since out regression results did not change much and our fit got worse as we went.
fit.best.2 <- lm(infdeath ~ le.female + le.male, data=poverty)
fit.best.3 <- lm(infdeath ~ le.female + le.male + birthrate, data=poverty)
fit.best.4 <- lm(infdeath ~ le.female + le.male + birthrate + log.gnp, data=poverty)
anova()
function in R to preform an \(F\)-test comparing these models. Which model would you choose based on this? Does it confirm your model chosen in question 9?