This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 1
Dataset: Poverty.
You will be using data from the paper The Statistics of Poverty and Inequality. There is the following data from 97 countries:
You can copy and paste the text below to read the data into R:
poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt", header=FALSE, sep="")
When you open this data you will see that there are no column names to show what these variables are. We can go ahead and add these with the following command:
library(tidyverse)
colnames(poverty)<- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")
#This will print some of the beginning rows of data
head(poverty)
## birthrate deathrate infdeath le.male le.female gnp group
## 1 24.7 5.7 30.8 69.6 75.5 600 1
## 2 12.5 11.9 14.4 68.3 74.7 2250 1
## 3 13.4 11.7 11.3 71.8 77.7 2980 1
## 4 12.0 12.4 7.6 69.8 75.9 * 1
## 5 11.6 13.4 14.8 65.4 73.8 2780 1
## 6 14.3 10.2 16.0 67.2 75.7 1690 1
## country
## 1 Albania
## 2 Bulgaria
## 3 Czechoslovakia
## 4 Former_E._Germany
## 5 Hungary
## 6 Poland
If we start to look at the data we realize that when the authors of this data set had missing data they used an * symbol to represent missing data. R does not handle this kind of data so we will fix this. We can also see that gnp is not being treated as a numeric variable and need to correct this.
library(tidyverse)
#R thinks gnp is factor(group) data
class(poverty$gnp)
#Replace the * values with NA
poverty[poverty=="*"] <- NA
# Tell R that gnp is numeric data
poverty$gnp <- as.numeric(as.character(poverty$gnp))
#log.gnp most economic models use log.gnp instead of gnp
poverty$log.gnp <- log(poverty$gnp)
## [1] "factor"
The above few steps allow us to load our data into R and do a little cleaning. We want you to do these steps so you can grow and gain experience with real data.
It is always a good idea to take an informal look at the data before model fitting. First take the time to start by exploring the data using various graphs
library(ggplot2)
#Histogram
ggplot(poverty, aes(birthrate)) + geom_histogram()
#Boxplots
ggplot(poverty, aes(x= "", y=le.female)) + geom_boxplot() + xlab("")
#Scatter Plots
ggplot(poverty, aes(x=le.female, y=infdeath)) + geom_point()
Take some time to explore each this data and perform some basic number summaries. For example:
#install.packages("stargazer")
library(tidyverse)
library(stargazer)
poverty %>%
stargazer(., type="html", title="Descriptive statistics", digits=2)
Statistic | N | Mean | St. Dev. | Min | Max |
birthrate | 97 | 29.23 | 13.55 | 9.70 | 52.20 |
deathrate | 97 | 10.84 | 4.65 | 2.20 | 25.00 |
infdeath | 97 | 54.90 | 45.99 | 4.50 | 181.60 |
le.male | 97 | 61.49 | 9.62 | 38.10 | 75.90 |
le.female | 97 | 66.15 | 11.01 | 41.20 | 81.80 |
gnp | 91 | 5,741.25 | 8,093.68 | 80 | 34,064 |
group | 97 | 3.95 | 1.74 | 1 | 6 |
log.gnp | 91 | 7.51 | 1.64 | 4.38 | 10.44 |
For more on stargazer, here is a cheatsheet
We will begin exploring the modeling potential of this data by considering the relationship between Life Expectancy of Females and Infant Death Rates. Using the lm() function we will regress the infant death rates onto life expectancy of females
library(broom)
fit1 <- lm(infdeath ~ le.female, data=poverty)
tidy(fit1, conf.int=T)[,-c(3:4)]
## term estimate p.value conf.low conf.high
## 1 (Intercept) 319.009847 8.510101e-59 302.145999 335.873694
## 2 le.female -3.992506 4.268595e-52 -4.244014 -3.740999
ggplot(poverty,aes(y = infdeath, x = le.female)) + geom_point() + geom_smooth()
#linear
library(broom)
fit1 <- lm(infdeath ~ le.female, data=poverty)
tidy(fit1, conf.int=T)[,-c(3:4)]
## term estimate p.value conf.low conf.high
## 1 (Intercept) 319.009847 8.510101e-59 302.145999 335.873694
## 2 le.female -3.992506 4.268595e-52 -4.244014 -3.740999
The p-value of the t-test for le.female
is less than 0.05, which means le.female
is significant.
The mean infant death rates is 319.00 when the life expectancy is zero. This would have no meaning in our problem.
The infant death rates will decrease with 3.9925 if the life expentancy of female increase with 1 unit.
confint(fit1)
## 2.5 % 97.5 %
## (Intercept) 302.145999 335.873694
## le.female -4.244014 -3.740999
#infant death rate on log gnp
ggplot(poverty,aes(y = infdeath, x = log.gnp)) + geom_point() + geom_smooth()
fit2 <- lm(infdeath ~ log.gnp, data=poverty)
tidy(fit2, conf.int=T)[,-c(3:4)]
confint(fit2)
## term estimate p.value conf.low conf.high
## 1 (Intercept) 221.1297 4.472534e-27 192.84267 249.41672
## 2 log.gnp -22.0789 3.829644e-20 -25.75845 -18.39934
## 2.5 % 97.5 %
## (Intercept) 192.84267 249.41672
## log.gnp -25.75845 -18.39934
#infant death rate on birthrate
ggplot(poverty,aes(y = infdeath, x = birthrate)) + geom_point() + geom_smooth()
fit3 <- lm(infdeath ~ birthrate, data=poverty)
tidy(fit3, conf.int=T)[,-c(3:4)]
confint(fit3)
## term estimate p.value conf.low conf.high
## 1 (Intercept) -30.280978 8.723632e-07 -41.700734 -18.861222
## 2 birthrate 2.914208 2.770565e-29 2.559415 3.269002
## 2.5 % 97.5 %
## (Intercept) -41.700734 -18.861222
## birthrate 2.559415 3.269002
infdeath
with the following variables:le.female
birthrate
log.gnp
fit4 <- lm(infdeath ~ le.female + birthrate + log.gnp, data = poverty)
summary(fit4)$adj.r.squared
## [1] 0.9075158
The adjusted R square is relatively high, which means that the model fits well.
summary(fit4)$adj.r.squared
summary(fit1)$adj.r.squared
summary(fit2)$adj.r.squared
summary(fit3)$adj.r.squared
## [1] 0.9075158
## [1] 0.9117778
## [1] 0.610644
## [1] 0.7339998
fit1 has the largest adjusted R square, which means the model with only \(le.female\) fits the data best.