This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 1
Dataset: Poverty.
You will be using data from the paper The Statistics of Poverty and Inequality. There is the following data from 97 countries:
You can copy and paste the text below to read the data into R:
poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt", header=FALSE, sep="")
When you open this data you will see that there are no column names to show what these variables are. We can go ahead and add these with the following command:
library(tidyverse)
colnames(poverty)<- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")
#This will print some of the beginning rows of data
head(poverty)
## birthrate deathrate infdeath le.male le.female gnp group
## 1 24.7 5.7 30.8 69.6 75.5 600 1
## 2 12.5 11.9 14.4 68.3 74.7 2250 1
## 3 13.4 11.7 11.3 71.8 77.7 2980 1
## 4 12.0 12.4 7.6 69.8 75.9 * 1
## 5 11.6 13.4 14.8 65.4 73.8 2780 1
## 6 14.3 10.2 16.0 67.2 75.7 1690 1
## country
## 1 Albania
## 2 Bulgaria
## 3 Czechoslovakia
## 4 Former_E._Germany
## 5 Hungary
## 6 Poland
If we start to look at the data we realize that when the authors of this data set had missing data they used an * symbol to represent missing data. R does not handle this kind of data so we will fix this. We can also see that gnp is not being treated as a numeric variable and need to correct this.
library(tidyverse)
#R thinks gnp is factor(group) data
class(poverty$gnp)
#Replace the * values with NA
poverty[poverty=="*"] <- NA
# Tell R that gnp is numeric data
poverty$gnp <- as.numeric(as.character(poverty$gnp))
#log.gnp most economic models use log.gnp instead of gnp
poverty$log.gnp <- log(poverty$gnp)
## [1] "factor"
The above few steps allow us to load our data into R and do a little cleaning. We want you to do these steps so you can grow and gain experience with real data.
It is always a good idea to take an informal look at the data before model fitting. First take the time to start by exploring the data using various graphs
library(ggplot2)
#Histogram
ggplot(poverty, aes(birthrate)) + geom_histogram()
#Boxplots
ggplot(poverty, aes(x= "", y=le.female)) + geom_boxplot() + xlab("")
#Scatter Plots
ggplot(poverty, aes(x=le.female, y=infdeath)) + geom_point()
Take some time to explore each this data and perform some basic number summaries. For example:
#install.packages("stargazer")
library(tidyverse)
library(stargazer)
poverty %>%
stargazer(., type="html", title="Descriptive statistics", digits=2)
Statistic | N | Mean | St. Dev. | Min | Max |
birthrate | 97 | 29.23 | 13.55 | 9.70 | 52.20 |
deathrate | 97 | 10.84 | 4.65 | 2.20 | 25.00 |
infdeath | 97 | 54.90 | 45.99 | 4.50 | 181.60 |
le.male | 97 | 61.49 | 9.62 | 38.10 | 75.90 |
le.female | 97 | 66.15 | 11.01 | 41.20 | 81.80 |
gnp | 91 | 5,741.25 | 8,093.68 | 80 | 34,064 |
group | 97 | 3.95 | 1.74 | 1 | 6 |
log.gnp | 91 | 7.51 | 1.64 | 4.38 | 10.44 |
For more on stargazer, here is a cheatsheet
We will begin exploring the modeling potential of this data by considering the relationship between Life Expectancy of Females and Infant Death Rates. Using the lm() function we will regress the infant death rates onto life expectancy of females
library(broom)
fit1 <- lm(infdeath ~ le.female, data=poverty)
tidy(fit1, conf.int=T)[,-c(3:4)]
## term estimate p.value conf.low conf.high
## 1 (Intercept) 319.009847 8.510101e-59 302.145999 335.873694
## 2 le.female -3.992506 4.268595e-52 -4.244014 -3.740999
Create a scatter plot of infant death rates and life expectancy of females to see what this relationship looks like. Describe this relationship.
What is the slope of this regression line? What is the intercept?
Perform a t-test for significance of the slope. What are your findings?
Interpret the intercept of this model. Does this have meaning in our problem?
Interpret the slope of this model. What conclusions might you draw from this?
Use the confint() function to find the confidence intervals of the coefficients of the the model.
Using these confidence intervals would you make the same conclusion as the t-test performed in question 3?
infdeath
with the following variables:le.female
birthrate
log.gnp
What is the adjusted \(R^2\) for this model. What does this number tell you?
How does the combined adjusted \(R^2\) model compare to the single variable model \(R^2\)’s?