Data wrangling

Selecting rows and columns

group <- c(rep("A", times = 50), rep("B", times = 50))
var1 <- rnorm(n = 100, mean = 0, sd = 1)
var2 <- rnorm(n = 100, mean = 25, sd = 5)
dat <- data.frame(group, var1, var2)

Make a new data frame comprised of column 1 (group) and column 3 (var2):

newdat <- dat[,c(1,3)]

Make a new data frame comprised of column 1 (group) and column 3 (var2) and rows 1 to 50:

newdat <- dat[1:50,c(1,3)]

Subsets of data frames can be made using the subset function:

sdat <- subset(dat, subset = (var1 != 0 & var2 <= 25))

Alternative:

id <- dat$var1 != 0 & dat$var2 <= 25
sdat <- data.frame(dat[id,])

Regular patterns

Generate a sequence of numbers from 0 to 10:

seq(from = 0, to = 10, by = 1)

Generate a sequence of numbers from 10 to 0:

seq(from = 10, to = 0, by = -1)

Generate a sequence of 6 numbers from 0 to 10:

seq(from = 0, to = 10, length = 6)

Generate a sequence of numbers from 0 to 10 by 2.5:

seq(from = 0, to = 10, by = 2.5)

Repetition

x = 1:4
rep(x, times = 3)

rep(x, times = c(2,3,3,2))

rep(x, times = 3)

rep(x, times = rep(3, length(x)))

x = "abc"
y = rep(x, times = 3)

Logical values

x = 10 * (1:10)
y = c(rep(1, times = 5), rep(2, times = 5))
x; y

Select those elements of x where y equals 1:

x[y == 1]

Select those elements of x where y equals 2:

x[y == 2]

Subsetting is often used as follows in data analysis:

mean(weight[sex == "male" & age < 30])

This can be read as: return the mean weight for males under age 30.

Lists and matrices

Vectors are the most common objects encountered in R, but sometimes the requirement that all elements must have the same type is too restrictive. There is another structure called a list which can be:

L = list(a = 1:3, b = "hello"); L

Matrices:

A = matrix(1:6, nrow = 3, ncol = 2); A

By default, matrices are created with their values running down successive columns. It is also possible to specify that the matrix be filled by rows.

B = matrix(1:6, nrow = 3, ncol = 2, byrow = TRUE); B

The number of rows and columns in a matrix can be obtained with the functions nrow and ncol. Both values can be obtained simultaneously with the function dim:

A = matrix(1:6, nrow = 3, ncol = 2)
nrow(A); ncol(A); dim(A)

Matrices can be made rather more useful by using row and column labels. A matrix can be labelled as follows:

A = matrix(1:6, nrow = 3)
dimnames(A) = list(c("sex", "drugs", "rock&roll"), c("this", "that"))

Rotate matrix clockwise by 90°:

dat <- matrix(c(1,2,3,4,5,6,7,8), ncol = 2); dat

nrow <- dim(dat)[1]
dat <- t(dat[nrow:1,])
dat

Sorting a data frame

Sort data frame dat in order of variable peid:

dat <- dat[sort.list(dat$peid),]

Multiple sorting:

dat <- dat[order(dat$peid, dat$id),]

Convert a continuous variable into a categorical variable

Here we used the cut function. Note that six cut points results in five levels:

time.lev <- factor(cut(dat$time, breaks = c(0, 182, 364, 547, 729, 1500), labels = c("1","3","5","7","9")))
dat <- data.frame(dat, time.lev)

Now set the contrasts for the factor you've just created. In this example the second level (i.e. time.lev == 3) is set as the baseline (reference) category:

dat$time.lev <- factor(dat$time.lev, levels = c(3,1,5,7,9), labels = c("3","1","5","7","9"))

Manipulating text

You have a column of data called V7 in which some rows contain the text:

day=2

You want to strip the day= so that just the numeric value remains. Use function gsub:

new <- gsub(pattern = "day=", replacement = "", x = V7)

Dates

Create a sequence of dates and re-format from 2001-03-01 to 01-Mar. Calculate the number of days between each element of the generated date sequence and 18 Feb 2001.

date <- seq(from = as.Date("1/3/2001", format = "%d/%m/%Y"), to = as.Date("1/1/2002", format = "%d/%m/%Y"), by = "1 month")
date.labels <- format(date, format = "%d-%b")
days <- as.integer(date - as.Date("18/2/01", format = "%d/%m/%y"))

The objects days and date.labels can be passed to the function axis:

par(cex.axis = 0.80, cex.lab = 0.80)
plot(c(0, 302), c(0, 2500), type = "n", xaxt = "n", xlab = "Date", ylab = "Cumulative number of infected premises")
axis(side = 1, at = days, labels = date.labels, las = 1, srt = 0, tick = TRUE)
legend(x = 180, y = 750, legend = c("Actual", "Predicted"), lwd = c(3,1), col = c(8,1), bty = "n", cex = 0.80)

Calculate age as at today's date and convert to years:

date.today <- Sys.Date()
age <- (date.today - bdays.julian)
age <- trunc(as.numeric(age) / 365.25)
data.frame(birthday = bdays, standard = bdays.julian, julian = as.numeric(bdays.julian), age = age)

Lookup tables

You have a series of unique values listed with an identifier in data frame A. In data frame B you have a list of identifiers (with duplicates) and you wish to record unique value along side the identifier:

names(parasite)
[1] "infection" "age" "weight" sex"

Return the position in the vector called age where age == 1 first appears:

match(1, parasite$age)
[1] 29

Return the sex of the first individual with age == 1:

parasite$sex[match(1, parasite$age)]
[1] male

You have a list of animal identifiers comprised of an irregular sequence of numbers. You want to re-assign these animals with identifiers numbered 1 to n. Make a list of the current identifiers and create a data frame listing the old and the new identifiers:

old <- c(dat$Srcid, dat$Desid)
old <- sort(unique(old))
fix <- as.data.frame(cbind(new = 1:length(old), old))

Re-assign the identifers:

dat$Srcid. <- fix$new[match(dat$Srcid, fix$old)]
dat$Desid. <- fix$new[match(dat$Desid, fix$old)]

Tabulations

grp <- rep(c("A","B","C"), times = 5)
grp[2:3] <- NA

table(factor(grp, exclude = NA))
table(factor(grp, exclude = NaN))

Extract unique records

Create a data frame listing key, xcoord, and ycoord. The variable key may appears multiple times in this data set and we want to remove duplicates:

geo <- dat[,c(1,3,4)]
id <- match(unique(geo$key), geo$key)
geo <- geo[id, 1:3]

Compare two vectors:

farms <- 1:95
sample <- c(1,5,10)
id <- farms %in% sample

File management

Extract data from a ZIP file where the files of interest are within a folder inside the ZIP file. Set the working folder and make a list of files in the working directory:

setwd("D:\\TEMP")
ifname <- list.files(path = getwd())

Process each ZIP file in the working directory:

for(i in 1:length(ifname)){
  # Define the stem of the file do we want to extract:
  stem <- substr(ifname[i], 1, 3)
  ofile <- paste(stem, "/", sep = "")

  # Open zip file and extract contents to working directory:
  unzip(ifname[i], junkpaths = FALSE)
}

Say, for example, we have a list of folders and within each folder there are files we want to copy to a working directory:

setwd("D:\\TEMP")
ifname <- list.files(path = getwd())

Process each folder in the working directory:

for(i in 1:length(ifname)){
  dfrom <- paste("D:\\TEMP\\DEM\\", ifname[i], "\\", sep = "")
  flist <- list.files(path = paste("D:\\TEMP\\DEM\\", ifname[i], sep = ""))
  ffrom <- paste(dfrom, flist, sep = "")

  dto <- "D:\\TEMP\\DEM\\"
  file.copy(from = ffrom, to = dto)
}

Database-like analyses

Data frame customer:

custid name age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sarah 56
5 Susan 18

customer <- data.frame(custid = 1:5, name = c("Joe", "Mika", "Lin", "Sarah", "Susan"), age = c(23,45,34,56,18))

Data frame purchase:

purchid custid purchase
1 3 fiction
2 1 biography
3 1 fiction
4 2 biography
5 3 fiction
6 4 fiction

purchase <- data.frame(purchid = 1:6, custid = c(3,1,1,2,3,4), purchase = c("fiction", "biography", "fiction", "biography", "fiction", "fiction"))

Order by

Sort by one column, preserving row integrity:

customer[order(customer[,2]),]

Returns:

custid name age
1 Joe 23
3 Lin 34
2 Mika 45
4 Sarah 56
5 Susan 18

Where

subset(customer, custid == 5)

Returns:

custid name age
5 Susan 18

Inner join

merge(x = purchase, y = customer, by.x = "custid", by.y = "custid", all = FALSE)

Returns:

custid purchid purchase name age
1 2 biography Joe 23
1 3 fiction Joe 23
2 4 biography Mika 45
3 1 fiction Lin 34
3 5 fiction Lin 34
4 5 fiction Sarah 56

Left outer join

merge(x = purchase, y = customer, by.x = "custid" , by.y = "custid", all = TRUE)

Returns:

custid purchid purchase name age
1 2 biography Joe 23
1 3 fiction Joe 23
2 4 biography Mika 45
3 1 fiction Lin 34
3 5 fiction Lin 34
4 5 fiction Sarah 56
5 NA NA Sarah 56

Group by

custsum <- merge(x = purchase, y = customer, by.x = "custid", by.y = "custid", all = FALSE)
xtabs(~name, custsum)

Returns:

Joe Lin Mika Sarah Susan
2 2 1 1 0

Update

customer[1,]$age <- 99
customer

Returns:

custid name age
1 Joe 99
2 Mika 45
3 Lin 34
4 Sarah 56
5 Susan 18

Insert

newcust <- data.frame(custid = 6, name = "Terry", age = 50)
rbind(newcust, customer)

Returns:

custid name age
6 Terry 50
1 Joe 99
2 Mika 45
3 Lin 34
4 Sarah 56
5 Susan 18

Delete

subset(customer, custid != 1)

Returns:

custid name age
2 Mika 45
3 Lin 34
4 Sarah 56
5 Susan 18