Introduction

This lab will be using R and R studio. Please have these on your computer prior to coming to lab. If you would like to work with R Markdown and learn that more, please download the R markdown file that created this, Lab 1

Dataset: Poverty.

You will be using data from the paper The Statistics of Poverty and Inequality. There is the following data from 97 countries:

  1. Eastern Europe
  2. South America and Mexico
  3. Western Europe, North America, Japan, Australia, New Zealand
  4. Middle East
  5. Asia
  6. Africa

You can copy and paste the text below to read the data into R:

poverty<- read.table(file="http://www.amstat.org/publications/jse/datasets/poverty.dat.txt", header=FALSE, sep="")

When you open this data you will see that there are no column names to show what these variables are. We can go ahead and add these with the following command:

library(tidyverse)
  colnames(poverty)<- c("birthrate", "deathrate", "infdeath", "le.male", "le.female",
"gnp", "group", "country")

#This will print some of the beginning rows of data
head(poverty)
##   birthrate deathrate infdeath le.male le.female  gnp group
## 1      24.7       5.7     30.8    69.6      75.5  600     1
## 2      12.5      11.9     14.4    68.3      74.7 2250     1
## 3      13.4      11.7     11.3    71.8      77.7 2980     1
## 4      12.0      12.4      7.6    69.8      75.9    *     1
## 5      11.6      13.4     14.8    65.4      73.8 2780     1
## 6      14.3      10.2     16.0    67.2      75.7 1690     1
##             country
## 1           Albania
## 2          Bulgaria
## 3    Czechoslovakia
## 4 Former_E._Germany
## 5           Hungary
## 6            Poland

If we start to look at the data we realize that when the authors of this data set had missing data they used an * symbol to represent missing data. R does not handle this kind of data so we will fix this. We can also see that gnp is not being treated as a numeric variable and need to correct this.

library(tidyverse)
#R thinks gnp is factor(group) data
class(poverty$gnp)


#Replace the * values with NA
poverty[poverty=="*"] <- NA

# Tell R that gnp is numeric data 
poverty$gnp <- as.numeric(as.character(poverty$gnp))

#log.gnp most economic models use log.gnp instead of gnp
poverty$log.gnp <- log(poverty$gnp)
## [1] "factor"

The above few steps allow us to load our data into R and do a little cleaning. We want you to do these steps so you can grow and gain experience with real data.

Looking at Data

It is always a good idea to take an informal look at the data before model fitting. First take the time to start by exploring the data using various graphs

library(ggplot2)
#Histogram
ggplot(poverty, aes(birthrate)) + geom_histogram()

#Boxplots
ggplot(poverty, aes(x= "", y=le.female)) + geom_boxplot() + xlab("")

#Scatter Plots
ggplot(poverty, aes(x=le.female, y=infdeath)) + geom_point()

Take some time to explore each this data and perform some basic number summaries. For example:

#install.packages("stargazer")
library(tidyverse)
library(stargazer)
poverty %>%
  stargazer(., type="html", title="Descriptive statistics", digits=2)
Descriptive statistics
Statistic N Mean St. Dev. Min Max
birthrate 97 29.23 13.55 9.70 52.20
deathrate 97 10.84 4.65 2.20 25.00
infdeath 97 54.90 45.99 4.50 181.60
le.male 97 61.49 9.62 38.10 75.90
le.female 97 66.15 11.01 41.20 81.80
gnp 91 5,741.25 8,093.68 80 34,064
group 97 3.95 1.74 1 6
log.gnp 91 7.51 1.64 4.38 10.44

For more on stargazer, here is a cheatsheet

Fitting a Regression Model

We will begin exploring the modeling potential of this data by considering the relationship between Life Expectancy of Females and Infant Death Rates. Using the lm() function we will regress the infant death rates onto life expectancy of females

library(broom)
fit1 <- lm(infdeath ~ le.female, data=poverty)
tidy(fit1, conf.int=T)[,-c(3:4)]
##          term   estimate      p.value   conf.low  conf.high
## 1 (Intercept) 319.009847 8.510101e-59 302.145999 335.873694
## 2   le.female  -3.992506 4.268595e-52  -4.244014  -3.740999
  1. Create a scatter plot of infant death rates and life expectancy of females to see what this relationship looks like. Describe this relationship.
ggplot(poverty,aes(y = infdeath, x = le.female)) + geom_point() + geom_smooth()

#linear 
  1. What is the slope of this regression line? What is the intercept?
library(broom)
fit1 <- lm(infdeath ~ le.female, data=poverty)
tidy(fit1, conf.int=T)[,-c(3:4)]
##          term   estimate      p.value   conf.low  conf.high
## 1 (Intercept) 319.009847 8.510101e-59 302.145999 335.873694
## 2   le.female  -3.992506 4.268595e-52  -4.244014  -3.740999
  1. Perform a t-test for significance of the slope. What are your findings?

The p-value of the t-test for le.female is less than 0.05, which means le.female is significant.

  1. Interpret the intercept of this model. Does this have meaning in our problem?

The mean infant death rates is 319.00 when the life expectancy is zero. This would have no meaning in our problem.

  1. Interpret the slope of this model. What conclusions might you draw from this?

The infant death rates will decrease with 3.9925 if the life expentancy of female increase with 1 unit.

  1. Use the confint() function to find the confidence intervals of the coefficients of the the model.
confint(fit1)
##                  2.5 %     97.5 %
## (Intercept) 302.145999 335.873694
## le.female    -4.244014  -3.740999
  1. Using these confidence intervals would you make the same conclusion as the t-test performed in question 3? Yes. No zero included in the intervals

Testing Further Relationships

  1. Quickly go through questions 1-7 for other relationships (On Your Own)
#infant death rate on log gnp
ggplot(poverty,aes(y = infdeath, x = log.gnp)) + geom_point() + geom_smooth()

fit2 <- lm(infdeath ~ log.gnp, data=poverty)
tidy(fit2, conf.int=T)[,-c(3:4)]
confint(fit2)
##          term estimate      p.value  conf.low conf.high
## 1 (Intercept) 221.1297 4.472534e-27 192.84267 249.41672
## 2     log.gnp -22.0789 3.829644e-20 -25.75845 -18.39934
##                 2.5 %    97.5 %
## (Intercept) 192.84267 249.41672
## log.gnp     -25.75845 -18.39934
#infant death rate on birthrate
ggplot(poverty,aes(y = infdeath, x = birthrate)) + geom_point() + geom_smooth()

fit3 <- lm(infdeath ~ birthrate, data=poverty)
tidy(fit3, conf.int=T)[,-c(3:4)]
confint(fit3)
##          term   estimate      p.value   conf.low  conf.high
## 1 (Intercept) -30.280978 8.723632e-07 -41.700734 -18.861222
## 2   birthrate   2.914208 2.770565e-29   2.559415   3.269002
##                  2.5 %     97.5 %
## (Intercept) -41.700734 -18.861222
## birthrate     2.559415   3.269002

Multiple Regressions

  1. Build a multiple linear regression model predicting infdeath with the following variables:
fit4 <- lm(infdeath ~ le.female + birthrate + log.gnp, data = poverty)
  1. What is the adjusted \(R^2\) for this model. What does this number tell you?
summary(fit4)$adj.r.squared
## [1] 0.9075158

The adjusted R square is relatively high, which means that the model fits well.

  1. How does the combined adjusted \(R^2\) model compare to the single variable model \(R^2\)’s?
summary(fit4)$adj.r.squared
summary(fit1)$adj.r.squared
summary(fit2)$adj.r.squared
summary(fit3)$adj.r.squared
## [1] 0.9075158
## [1] 0.9117778
## [1] 0.610644
## [1] 0.7339998

fit1 has the largest adjusted R square, which means the model with only \(le.female\) fits the data best.