Exploratory data analysis

The object ChickWeight is a data frame included in the datasets package. It is comprised of 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. There are four variables: (1) weight, a numeric vector giving the body weight of each chick (grams), (2) Time a numeric vector giving the number of days since birth when the measurement was made, (3) Chick an ordered factor giving the unique identifier for each chick (the ordering of the levels groups chicks on the same diet together and orders them according to their final weight (lightest to heaviest) within diet) and (4) Diet a factor with levels indicating which experimental diet the chick received.

Load the data and inspect the first and last 6 rows:

data(ChickWeight); dat <- ChickWeight
head(dat); tail(dat)
names(dat)

Univariate analyses: continuous data

Numerical summaries:

Counts:

length(dat$weight)

Measures of central tendency: mean and median:

mean(dat$weight)

Mean weight for diet 1:

mean(dat$weight[dat$Diet == 1])

Median weight:

median(dat$weight)

Mode:

hist(dat$weight, ylim = c(0, 0.01), prob = TRUE)
dens <- density(dat$weight)
lines(dens)
mode <- round(dens$x[dens$y == max(dens$y)], digits = 0)
mode

Measures of variability: range, quartiles, variance and standard deviation:

range(dat$weight)
quantile(dat$weight, probs = c(0.25, 0.50, 0.75))
var(dat$weight)
sd(dat$weight)

Coefficient of variation:

sd(dat$weight) / mean(dat$weight)

Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum):

fivenum(dat$weight, na.rm = TRUE)

Descriptive statistics using the epiR package (returns the number of missing values, which some of the other functions don't do):

library(epiR)
epi.descriptives(dat$weight)

The function apply is used to perform row- or column-wise calculations. The first argument specifies the variable of interest. Where MARGIN = 1 the function FUN will be applied across rows. Where MARGIN = 2 the function FUN will be applied across columns.

apply(dat[,1:2], MARGIN = 2, FUN = sum)

Graphical summaries:

Frequency histogram:

hist(dat$weight, col = "dark blue", border = "gray", xlim = c(0, 400), ylim = c(0, 200), xlab = "Bodyweight (g)", ylab = "Frequency", freq = TRUE, main = "")

Q-Q plot:

qqnorm(dat$weight)

The qq.plot function in the car package is useful, because it returns a reference line and confidence limits:

library(car)
qq.plot(dat$weight, distribution = "norm")

Stem and leaf plot:

stem(dat$weight)

Univariate analyses: categorical data

Numerical summaries:

table(dat$Diet)
table(dat$Time)
table(dat$Chick)

table(dat$Diet, dat$Time)

Mode:

f <- table(dat$Diet)
as.numeric(names(f[max(f) == f]))

Graphical summaries:

mosaicplot(~ Time + Diet, data = dat, color = TRUE)

Bivariate analyses

Numerical summaries:

The by function is useful for producing summary statistics for each level of a given strata. The first argument to the function specifies the variable of interest. The second argument (INDICES) specifies the factor, and the third argument (FUN) defines the function you want to run.

by(dat$weight, INDICES = dat$Diet, FUN = mean)
by(dat$weight, dat$Time, FUN = mean)

by(dat$weight, dat$Diet, FUN = summary)
by(dat$weight, dat$Time, FUN = summary)

Graphical summaries:

Box and whisker plots:

par(mfrow = c(1,2), pty = "s")
boxplot(weight ~ Diet, xlab = "Diet", ylab = "Bodyweight (g)", horizontal = FALSE, data = dat)
boxplot(weight ~ Time, xlab = "Diet", ylab = "Bodyweight (g)", horizontal = FALSE, data = dat)

Dates

dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
dates <- as.Date(dates, "%m/%d/%y")
dates - 183

Format dates into a character day-month format (useful for labelling the axis of plots):

dates <- format(dates, format = "%d-%b")