Transforming Data with `dplyr`

Adam J Sullivan
Assistant Professor of Biostatistics
Brown University

`tidyr`

The principles of tidy data provide a standard way to organize data values within a dataset.

--Hadley Wickham (2014)

The `spread()` Function

The first tidyr function we will look into is the spread() function.
With spread() it does similar to what you would expect.
We have a data frame where some of the rows contain information that is really a variable name.
This means the columns are a combination of variable names as well as some data.

What does it do?

The picture below displays this:

Consider this Data:

## # A tibble: 12 x 4
##    country      year key             value
##    <fct>       <int> <fct>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

`key` column

Notice that in the column of key, instead of there being values we see the following variable names:
- cases
- population

What Should we see?

In order to use this data we need to have it so the data frame looks like this instead:

## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Using the `spread()` Function

Now we can see that we have all the columns representing the variables we are interested in and each of the rows is now a complete observation.
In order to do this we need to learn about the spread() function:

The `spread()` Function

spread(data, key, value)

Where
- data is your dataframe of interest.
- key is the column whose values will become variable names.
- value is the column where values will fill in under the new variables created from key.

Piping

If we consider piping, we can write this as:

data %>%
  spread(key, value)

`spread()` Example

Now if we consider table2 , we can see that we have:

## # A tibble: 12 x 4
##    country      year key             value
##    <fct>       <int> <fct>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

`spread()` Example

Now this table was made for this example so key is the key in our spread() function and value is the value in our spread() function.
We can fix this with the following code:

`spread()` Example

table2 %>%
    spread(key,value)

## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

`spread()` Example

We can now see that we have a variable named cases and a variable named population.
This is much more tidy.

On Your Own: RStudio Practice

In this example we will use the dataset population that is part of tidyverse.
Print this data:

## # A tibble: 1 x 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1995   17586073

On Your Own: RStudio Practice

You should see the table that we have above, now We have a variable named year, assume that we wish to actually have each year as its own variable.
Using the spread() function, redo this data so that each year is a variable.
Your data will look like this at the end:

On Your Own: RStudio Practice

## # A tibble: 219 x 20
##    country `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003`
##    <chr>    <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
##  1 Afghan~ 1.76e7 1.84e7 1.90e7 1.95e7 2.00e7 2.06e7 2.13e7 2.22e7 2.31e7
##  2 Albania 3.36e6 3.34e6 3.33e6 3.33e6 3.32e6 3.30e6 3.29e6 3.26e6 3.24e6
##  3 Algeria 2.93e7 2.98e7 3.03e7 3.08e7 3.13e7 3.17e7 3.22e7 3.26e7 3.30e7
##  4 Americ~ 5.29e4 5.39e4 5.49e4 5.59e4 5.68e4 5.75e4 5.82e4 5.87e4 5.91e4
##  5 Andorra 6.39e4 6.43e4 6.41e4 6.38e4 6.41e4 6.54e4 6.80e4 7.16e4 7.56e4
##  6 Angola  1.21e7 1.25e7 1.28e7 1.31e7 1.35e7 1.39e7 1.44e7 1.49e7 1.54e7
##  7 Anguil~ 9.81e3 1.01e4 1.03e4 1.05e4 1.08e4 1.11e4 1.14e4 1.17e4 1.20e4
##  8 Antigu~ 6.83e4 7.02e4 7.22e4 7.42e4 7.60e4 7.76e4 7.90e4 8.00e4 8.09e4
##  9 Argent~ 3.48e7 3.53e7 3.57e7 3.61e7 3.65e7 3.69e7 3.73e7 3.76e7 3.80e7
## 10 Armenia 3.22e6 3.17e6 3.14e6 3.11e6 3.09e6 3.08e6 3.06e6 3.05e6 3.04e6
## # ... with 209 more rows, and 10 more variables: `2004` <int>,
## #   `2005` <int>, `2006` <int>, `2007` <int>, `2008` <int>, `2009` <int>,
## #   `2010` <int>, `2011` <int>, `2012` <int>, `2013` <int>

The `gather()` Function

The second tidyr function we will look into is the gather() function.
With gather() it may not be clear what exactly is going on, but in this case we actually have a lot of column names the represent what we would like to have as data values.

The `gather()` Function Example

For example, in the last spread() practice you created a data frame where variable names were individual years.
This may not be what you want to have so you can use the gather function.

Consider `table4`:

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <fct>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Table 4

This looks similar to the table you created in the spread() practice.
We now wish to change this data frame so that year is a variable and 1999 and 2000 become values instead of variables.

In Comes the `gather()` Function

We will accomplish this with the gather function:

gather(data, key, value, ...)

where
- data is the data frame you are working with.
- key is the name of the key column to create.
- value is the name of the value column to create.
- ... is a way to specify what columns to gather from.

`gather()` Example

In our example here we would do the following:

table4 %>%
    gather("year", "cases", 2:3)

## # A tibble: 6 x 3
##   country     year   cases
##   <fct>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

`gather()` Example

You can see that we have created 2 new columns called year and cases.
We filled these with the previous 2nd and 3rd columns.
Note that we could have done this in many different ways too.

`gather()` Example

For example if we knew the years but not which columns we could do this:

table4 %>%
    gather("year", "cases", "1999":"2000")

We could also see that we want to gather all columns except the first so we could have used:

table4 %>%
    gather("year", "cases", -1)

On Your Own: RStudio Practice

Create population2 from last example:

population 2 <- population %>% 
                    spread(year, population)

Now gather the columns that are labeled by year and create columns year and population.

On Your Own: RStudio Practice

In the end your data frame should look like:

## # A tibble: 2 x 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1995   17586073
## 2 Afghanistan  1996   18415307

The `dplyr` Package

Now that we have started to tidy up our data we can see that we have a need to transform this data.
We may wish to add additional variables.
The dplyr package allows us to further work with our data.

`dplyr` Functionality

With dplyr we have five basic verbs that we will learn to work with:
- filter()
- select()
- arrange()
- mutate()
- summarize()

Filtering

At this point we will consider how we pick the rows of the data that we wish to work with.
If you consider many modern data sets, we have so much information that we may not want to bring it all in at once.
R brings data into the RAM of your computer. This means you can be limited for what size data you can bring in at once.
Very rarely do you need the entire data set.
We will focus on how to pick the rows or observations we want now.

Enter the `filter()` Function

The filter() function chooses rows that meet a specific criteria.
We can do this with Base R functions or with dplyr.

library(dplyr)

Filtering Example

Let's say that we want to look at the data just for the country of Kenya in 2000
We could do this without learning a new command and use indexing which we learned yesterday.

gapminder[gapminder$country=="Kenya" & gapminder$year==2002, ]

## # A tibble: 1 x 6
##   country continent  year lifeExp      pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Kenya   Africa     2002    51.0 31386842     1288.

Filtering Example

Now this is not very difficult to do, what we have is that we are working with gapminder and we only want to keep the rows of data there country=="Kenya and year==2002.
However we could use the filter() function to do this in a much easier to read format:

`filter()` Function

filter(.data, ...)

where

.data is a tibble.
... is a set of arguments the data you want returned needs to meet.

Filtering Example

This means in our example we could perform the following:

gapminder %>%
    filter(country=="Kenya", year==2002)

Finally we could also only do one filtering at a time and chain it:

gapminder %>%
    filter(country=="Kenya") %>%
    filter(year==2002)

Further Filtering

filter() supports the use of multiple conditions where we can use Boolean.
For example if we wanted to consider only observations that have a life expectancy between 49 and 60, we could do the following:

gapminder %>% filter(lifeExp>=49 & lifeExp<60)

Further Filtering

## # A tibble: 373 x 6
##    country    continent  year lifeExp      pop gdpPercap
##    <fct>      <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Albania    Europe     1952    55.2  1282697     1601.
##  2 Albania    Europe     1957    59.3  1476505     1942.
##  3 Algeria    Africa     1967    51.4 12760499     3247.
##  4 Algeria    Africa     1972    54.5 14760787     4183.
##  5 Algeria    Africa     1977    58.0 17152804     4910.
##  6 Bahrain    Asia       1952    50.9   120447     9867.
##  7 Bahrain    Asia       1957    53.8   138655    11636.
##  8 Bahrain    Asia       1962    56.9   171863    12753.
##  9 Bahrain    Asia       1967    59.9   202182    14805.
## 10 Bangladesh Asia       1982    50.0 93074406      677.
## # ... with 363 more rows

Further Filtering

We can also use the filter() function to remove missing data for us.
Previously we learned about a class of functions called is.foo() where foo represents a data type.
We could choose to only use observations that have life expectancy.
That means we wish to not have missing data for life expectancy.

gapminder %>% filter(!is.na(lifeExp))

On Your Own: RStudio Practice

Using the filter() function and chaining:

Choose only rows associated with
- Uganda
- Rwanda

On Your Own: RStudio Practice

Your end result should be:

## # A tibble: 24 x 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda  Africa     1952    40   2534927      493.
##  2 Rwanda  Africa     1957    41.5 2822082      540.
##  3 Rwanda  Africa     1962    43   3051242      597.
##  4 Rwanda  Africa     1967    44.1 3451079      511.
##  5 Rwanda  Africa     1972    44.6 3992121      591.
##  6 Rwanda  Africa     1977    45   4657072      670.
##  7 Rwanda  Africa     1982    46.2 5507565      882.
##  8 Rwanda  Africa     1987    44.0 6349365      848.
##  9 Rwanda  Africa     1992    23.6 7290203      737.
## 10 Rwanda  Africa     1997    36.1 7212583      590.
## # ... with 14 more rows

Selecting

The next logical step would be to select the columns we want as well.
Many times we have so many columns that we are no interested in for a particular analysis.
Instead of slowing down your analysis by continuing to run through extra data, we could just select the columns we care about.

Enter the `select()` Function

The select() function chooses columns that we specify.
Again we can do this with base functions or with dplyr.
We feel that as you continue on with your R usage that you will most likely want to go the route of dplyr functions instead.

Select Example

Let's say that we want to look at the gapminder data but we are really only interested in the country, life expectancy and year.
This seems reasonable if we are a customer and wanted to only know these pieces of information. We could do this with indexing :

gapminder[, c("country", "lifeExp", "year")]

`select()` Function

select(.data, ...)

where

.data is a tibble.
... are the columns that you wish to have in bare (no quotations)

Selecting Example Continued

We could then do the following

gapminder %>%
  select(country, lifeExp, year)

Selecting Example Continued

# A tibble: 1,704 x 3
       country lifeExp  year
        <fctr>   <dbl> <int>
 1 Afghanistan  28.801  1952
 2 Afghanistan  30.332  1957
 3 Afghanistan  31.997  1962
 4 Afghanistan  34.020  1967
 5 Afghanistan  36.088  1972
 6 Afghanistan  38.438  1977
 7 Afghanistan  39.854  1982
 8 Afghanistan  40.822  1987
 9 Afghanistan  41.674  1992
10 Afghanistan  41.763  1997
# ... with 1,694 more rows

Removing Columns

We may wish to pick certain columns that we wish to have but we also may want to remove certain columns.
It is quite common to de-identify a dataset before actually distributing it to a research team.
The select() function will also remove columns.

Removing Columns

Lets say that we wished to remove the gdpPercap and pop of the countries:

gapminder %>%
  select(-gdpPercap,-pop)

Removing Columns

We also could use a vector for this:

cols <- c("gdpPercap", "pop")
gapminder %>%
  select(-one_of(cols))

Removing Columns

We can also remove columns that contain a certain phrase in the name.
If we were interested in removing any columns that had to do with time we could search for the phrase "co" in the data and remove them:

gapminder %>%
  select(-matches("co"))

Removing Columns

## # A tibble: 1,704 x 4
##     year lifeExp      pop gdpPercap
##    <int>   <dbl>    <int>     <dbl>
##  1  1952    28.8  8425333      779.
##  2  1957    30.3  9240934      821.
##  3  1962    32.0 10267083      853.
##  4  1967    34.0 11537966      836.
##  5  1972    36.1 13079460      740.
##  6  1977    38.4 14880372      786.
##  7  1982    39.9 12881816      978.
##  8  1987    40.8 13867957      852.
##  9  1992    41.7 16317921      649.
## 10  1997    41.8 22227415      635.
## # ... with 1,694 more rows

Unique Observations

Many times we have a lot of repeats in our data.
If we just would like to have an account of all things included then we can use the unique() command.
Lets assume that we wish to know which countries are in the data.
We do not want to have every country listed over and over again so we ask for unique values:

gapminder %>% 
  select(country) %>% 
  unique()

On Your Own: RStudio Practice

Consider the gapminder data: gapminder.
1. Select all but the pop column.
2. Remove the year and lifeExp from them.
3. Select values which contain "p" in them.
4. Chain these together so that you run a command and it does all of these things.

On Your Own: RStudio Practice

Your answer should look like:

## # A tibble: 1,704 x 1
##    gdpPercap
##        <dbl>
##  1      779.
##  2      821.
##  3      853.
##  4      836.
##  5      740.
##  6      786.
##  7      978.
##  8      852.
##  9      649.
## 10      635.
## # ... with 1,694 more rows

Arranging the Data

We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange() function.
Again we can do this in base R but this is not always a clear path.

Arranging the Data Example

Let's say that we wish to look countries, year and life expectancy.
Thensay further we want to sort it by Life Expactancy.
In base R we would have to run the following command:

library(gapminder)
library(tidyverse)
gapminder[order(gapminder$lifeExp), c("country", "year", "lifeExp")]

Enter the `arrange()` Function

We could do this in an easy manner using the arrange() function:
```
arrange(.data, ...)
```
Where
- .data is a data frame of interest.
- ... are the variables you wish to sort by.

Arranging the Data Example Continued

gapminder %>%
    select(country,year,  lifeExp) %>%
    arrange(lifeExp)

Arranging the Data Example Continued

## # A tibble: 1,704 x 3
##    country       year lifeExp
##    <fct>        <int>   <dbl>
##  1 Rwanda        1992    23.6
##  2 Afghanistan   1952    28.8
##  3 Gambia        1952    30  
##  4 Angola        1952    30.0
##  5 Sierra Leone  1952    30.3
##  6 Afghanistan   1957    30.3
##  7 Cambodia      1977    31.2
##  8 Mozambique    1952    31.3
##  9 Sierra Leone  1957    31.6
## 10 Burkina Faso  1952    32.0
## # ... with 1,694 more rows

Arranging the Data Example Continued

With arrange() we first use select() to pick the only columns that we want and then we arrange by the lifeExp.
If we had wished to order them in a descending manner we could have simply used the desc() function:

gapminder %>%
    select(country,year,  lifeExp) %>%
    arrange(desc(lifeExp))

More Complex Arrange

Lets consider that we wish to look at the top 3 Life Expectancies for each year.
Then we wish to order them from largest to smallest Life Expectancy.
We then need to do the following:
1. Group by year.
2. Pick the top 3 life expectancy
3. order them largest to smallest

More Complex Arrange Continued

gapminder %>% 
  group_by(year) %>%  
  top_n(3, lifeExp) %>% 
  arrange(desc(lifeExp))

Where
- group_by() is a way to group data. This way we perform operations on a group. So top 3 life expectancy are grouped by year.
- top_n()takes a tibble and returns a specific number of rows based on a chosen value.

More Complex Arrange Continued

## # A tibble: 36 x 6
## # Groups:   year [12]
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  6 Japan            Asia       1997    80.7 125956499    28817.
##  7 Switzerland      Europe     2002    80.6   7361757    34481.
##  8 Hong Kong, China Asia       1997    80     6495918    28378.
##  9 Sweden           Europe     1997    79.4   8897619    25267.
## 10 Japan            Asia       1992    79.4 124329269    26825.
## # ... with 26 more rows

On Your Own: RStudio Practice

Perform the following operations:
- Group by year.
- use sample_n() to pick 1 observation per year
- Arrange by longest to smallest life expectancy.

On Your Own: RStudio Practice

Your answer may look like:

gapminder %>%
  group_by(year) %>%
  sample_n(1) %>%
  arrange(desc(lifeExp))

Summarizing Data

As you have seen in your own work, being able to summarize information is crucial.
We need to be able to take out data and summarize it as well.
We will consider doing this using the summarise() function.

Transforming Data with `dplyr`

tidyr

The spread() Function

The spread() Function

What does it do?

Consider this Data:

key column

What Should we see?

Using the spread() Function

The spread() Function

Piping

spread() Example

spread() Example

spread() Example

spread() Example

On Your Own: RStudio Practice

On Your Own: RStudio Practice

On Your Own: RStudio Practice

The gather() Function

The gather() Function

The gather() Function Example

Consider table4:

Table 4

In Comes the gather() Function

gather() Example

gather() Example

gather() Example

On Your Own: RStudio Practice

On Your Own: RStudio Practice

The dplyr Package

The dplyr Package

dplyr Functionality

Filtering

Filtering

Enter the filter() Function

Filtering Example

Filtering Example

filter() Function

Filtering Example

Further Filtering

Further Filtering

Further Filtering

On Your Own: RStudio Practice

On Your Own: RStudio Practice

Selecting

Selecting

Enter the select() Function

Select Example

select() Function

Selecting Example Continued

Selecting Example Continued

Removing Columns

Removing Columns

Removing Columns

Removing Columns

Removing Columns

Unique Observations

On Your Own: RStudio Practice

On Your Own: RStudio Practice

Arranging the Data

Arranging the Data

Arranging the Data Example

Enter the arrange() Function

Arranging the Data Example Continued

Arranging the Data Example Continued

Arranging the Data Example Continued

More Complex Arrange

More Complex Arrange Continued

More Complex Arrange Continued

On Your Own: RStudio Practice

On Your Own: RStudio Practice

Summarizing Data

Summarizing Data

`tidyr`

The `spread()` Function

The `spread()` Function

`key` column

Using the `spread()` Function

The `spread()` Function

`spread()` Example

`spread()` Example

`spread()` Example

`spread()` Example

The `gather()` Function

The `gather()` Function

The `gather()` Function Example

Consider `table4`:

In Comes the `gather()` Function

`gather()` Example

`gather()` Example

`gather()` Example

The `dplyr` Package

The `dplyr` Package

`dplyr` Functionality

Enter the `filter()` Function

`filter()` Function

Enter the `select()` Function

`select()` Function

Enter the `arrange()` Function