The R package dplyr provides a collection of `verbs' for manipulating data frames. Each dplyr verb takes a data frame as input and returns a modified version of it. The philosophy of the package design is that complex operations can be performed by stringing together a series of simpler operations in a pipeline.
A tibble is tidyverse's improved data frame format. The default is that all dplyr verbs return their results as tibbles. You can create tibbles explicitly with the tibble function. One convenient feature of the tibble data frame format is that it only shows a few rows of the data frame when you print it.
Load the iris data frame from the data sets package and view the first six rows:
library(tidyr); library(dplyr)
data(iris)
head(iris)
Convert iris to a tibble --- no real need to do this because iris will be converted to a tibble when dplyr's verbs are applied later:
tiris <- as_tibble(iris)
tiris
The function filter is the equivalent of subset in base R:
firis <- filter(iris, Sepal.Length > 5.8)
dim(iris); dim(firis)
Data frame iris is comprised of 150 rows of data. Data frame firis is comprised of 70 rows of data. Note that when setting the filter argument (i.e. Sepal.Length > 5) we don't have to use iris$Sepal.Length.
Rather than filtering the data, we might instead want to sort it. The function arrange sorts a data frame by one or more columns:
Sort by Petal.Length:
airis <- arrange(iris, Petal.Length)
head(airis); tail(airis)
Sort by Petal.Length and Petal.Width:
airis <- arrange(iris, Petal.Length, Petal.Width)
head(airis); tail(airis)
desc() can be used to reverse the sort order:
airis <- arrange(iris, desc(Petal.Length))
head(airis)
dplyr's select function can be used for subsetting, renaming, and reordering columns. Old columns can be referred to by either their name or by number:
Select columns Species, Sepal.Length, and Sepal.Width from data frame iris:
siris <- select(iris, Species, Sepal.Length, Sepal.Width)
head(siris)
The same thing using columns numbers instead of column names:
siris <- select(iris, 5, 1, 2)
head(siris)
Change variable names:
siris <- select(iris, spp = Species, slen = Sepal.Length, swid = Sepal.Width)
head(siris)
You can also remove specific columns using:
siris <- select(iris, -Petal.Length, -Petal.Width)
head(siris)
You can also refer to ranges of columns to select or remove:
siris <- select(iris, 1:3)
head(siris)
Say we want to convert a factor into a numeric score, so we can produce numerical summaries. The scoring scheme will be:
fspp <- levels(iris$Species)
jiris <- tibble(species = factor(fspp, levels = fspp), score = c(1,2,3))
dplyr has several join functions: left_join, right_join, inner_join, full_join and anti_join. The difference between these functions and the corresponding functions base R is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join discards such rows. full_join always keeps them, filling in missing data with NA. left_join always keeps rows from the first data frame. right_join always keeps rows from the second data frame. anti_join is a bit different, it gives you rows from the first data frame that aren't in the second data frame.
Often we use a join to augment a data frame with some extra information. left_join is a good default choice for this as it will never delete rows from the data frame that is being augmented.
Create a second data frame to be joined to the data frame iris:
att <- data.frame(Species = c("setosa", "versicolor", "virginica"), color = c("red", "green", "blue"))
jiris <- left_join(iris, att, by = "Species")
One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.
The function mutate lets us add or overwrite columns by computing a new value for them.
miris <- mutate(iris, Petal.comb = Petal.Length * Petal.Width)
Which is equivalent to:
miris <- iris
miris$Petal.comb <- iris$Petal.Length * iris$Petal.Width
The function summarize lets us compute summaries of data:
siris <- summarize(iris, Sepal.Length = mean(Sepal.Length), Sepal.Width = mean(Sepal.Width), Petal.Length = min(Petal.Length), Petal.Width = min(Petal.Width))
The usual way to use summarize is with group_by:
giris <- group_by(iris, Species)
summarise(giris, Sepal.Length = mean(Sepal.Length), Sepal.Width = mean(Sepal.Width), Petal.Length = min(Petal.Length), Petal.Width = min(Petal.Width))
The special function n() can be used within summarize to return the number of rows. This also works in mutate, but is most useful in summarize.
giris <- group_by(iris, Species)
summarise(giris, n = n(), Sepal.Length = mean(Sepal.Length), Sepal.Width = mean(Sepal.Width), Petal.Length = min(Petal.Length), Petal.Width = min(Petal.Width))
We often want to string together a series of dplyr functions. This is achieved using dplyr's pipe operator, %>%. This takes the value on the left, and passes it as the first argument to the function call on the right.
miris <- iris %>%
select(Sepal.Length, Sepal.Width, Species) %>%
filter(Sepal.Length > 5) %>%
group_by(Species)
summarise(miris, mean = mean(Sepal.Length))
Generate a data frame in wide format:
code <- c("AFG", "ALB")
country <- c("Afghanistan", "Albania")
y1950 <- c(20249, 8097)
y1951 <- c(21352,8986)
y1952 <- c(22532,10058)
y1953 <- c(23557,11123)
y1954 <- c(24555,12246)
dwide <- data.frame(code, country, y1951, y1952, y1953, y1954)
Wide to long:
dlong <- gather(data = dwide, key = year, value = pop, -c(code, country))
Alternative:
dlong <- dwide %>%
gather(key = year, value = pop, -c(code, country))
Long to wide:
dwide <- spread(data = dlong, key = year, value = pop)
Alternative:
dwide <- dlong %>%
spread(key = year, value = pop)
The iris data frame is comprised of 150 rows of data with records of sepal width, sepal length, petal width and petal length, by species. Here we select the columns Sepal.Length, Sepal.Width, and Species and the filter the data to only include those records where Sepal.Length > 5:
miris <- iris %>%
select(Sepal.Length, Sepal.Width, Species) %>%
filter(Sepal.Length > 5)
head(miris)
Do the same, but now group by Species:
miris <- iris %>%
select(Sepal.Length, Sepal.Width, Species) %>%
filter(Sepal.Length > 5) %>%
group_by(Species)
summarise(miris, mean = mean(Sepal.Length))
A group of lambs were weighed weekly for three weeks. The data (in wide format) is presented to you as follows:
| id | week1 | week2 | week3 |
| A | 12 | 15 | 18 |
| B | 15 | 18 | 21 |
| C | 16 | 19 | 22 |
| D | 14 | 13 | 20 |
id <- c("A", "B", "C", "D")
week1 <- c(12,15,16,14)
week2 <- c(15,18,19,13)
week3 <- c(18,21,22,20)
dwide <- data.frame(id, week1, week2, week3); dwide
There are four lambs (A, B, C, and D), with three measurements recorded for each case. Convert this data from wide format to long format:
dlong <- gather(data = dwide, key = week, value = epg, -c(id))
dlong