R is a statistical analysis system created by Ross Ihaka and Robert Gentleman at the University of Auckland in 1996. R is both a language and a software package; its features are:
R is a language considered as a dialect of the S language created by the AT&T Bell Laboratories. R is freely distributed under the terms of the GNU Public Licence of the Free Software Foundation; its development and distribution are carried on by several statisticians known as the R Development Core Team. A key element in this development is the Comprehensive R Archive Network (CRAN).
R is a language with many functions for statistical analyses and graphics; the latter are visualised immediately in their own window and can be saved in various formats (for example, *.jpg and *.png under Windows). The results from a statistical analysis can be displayed on the screen, some intermediate results (P-values, regression coefficients) can be written to a file or used in subsequent analyses. The R language allows the user, for instance, to program loops of commands to successively analyse several data sets. It is also possible to combine in a single program different statistical functions to perform complex analyses.
Reference: Ihaka R, Gentleman R (1996) R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 5: 299 - 314.
Once R is installed on your computer, the software is accessed by launching the corresponding executable (RGui.exe under Windows, R under Unix). The prompt > indicates that R is waiting for your commands (Figure 1).
![\TeX \begin{figure}[h]
\centering
\includegraphics[width=15cm]{../../images/Rgui.png}
\caption{The R graphic user interface.}
\label{fig:Rgui}
\end{figure}](../../images/Rgui.png)
Under Windows, some commands related to the system (accessing the on-line help, opening files and so on) can be executed via pull-down menus, but most have to be typed into the console using the keyboard.
R is an object-oriented language: the variables, data, matrices, functions, results, etc. are stored in the active memory of the computer in the form of objects which have a name: one has just to type the name of the object to display its content. For example, if can assign an object n with the value of 15:
n <- 15
n
When we type n the value [1] 15 is returned. The digit 1 within square brackets indicates that the display starts at the first element of n. The 'assign' symbol (<-) is used to give a value to an object. The assign symbol is written with a bracket (< or >) together with a minus sign so that they make a small arrow which can be directed from left to right, or right to left:
5 -> n
n
The value given to a named object can be the result of an arithmetic expression:
n <- 10 + 2
n
You can simply type an expression without assigning its value to an object, the result is displayed but not stored in memory:
(10 + 2) * 5
The tasks you carry out with your data in R are carried out using functions. Think of a function as a special kind of object that can perform an action for you, producing output in response to the input you provide. When we want a function to do something, we call it. For example:
Generate 100 numbers ar random between 0 and 1 using the runif (random uniform) function:
runif(n = 100, min = 0, max = 1)
Functions can be distinguished from other objects in R because of the parentheses at the end of their names. This distinguishes them from other objects (such as character strings, vectors, matrices and so on). Most functions accept one or more named arguments. In the example above, the arguments for the function runif are n (the number of elements to be generated), min (the minimun bound of the distribution) and max (the maximum bound of the distribution).
Functions take inputs via their arguments, do something with those arguments and return outputs. For function outputs we have two options. Either we can return them directly to the console (as in the runif example above) or, we can give them a name. If you assign the output of a function with a name, you have to type the name of the output into the R console to view it:
myran <- runif(n = 100, min = 0, max = 1)
myran
You can create your own functions. For example:
myMean <- function(x) {
n.x <- length(x)
sum.x <- sum(x)
rval <- sum.x / n.x
rval
}
Test the function:
x <- c(2,4,6,8,10,12)
myMean(x)
Now save the function as a text file with a *.R extension (e.g. myMean.R). To load the function during a future R session, use the source() function:
source("D:\\TEMP\\myMean.r")
Families of functions can be bundled together in a package (sometimes called a library). Packages are downloaded from the Comprehensive R Area Network (CRAN) and installed on your computer. When you start R, you `load' the packages you require into your R session. The packages that you load into each R session will depend on the analysis you are working on. For example, for epidemiological analyses we might use functions available in the epiR package:
library(epiR)
And do a basic 2 × 2 table analysis:
dat <- matrix(c(13,2163,5,3349), nrow = 2, byrow = TRUE)
rownames(dat) <- c("DF+", "DF-"); colnames(dat) <- c("FUS+", "FUS-"); dat
epi.2by2(dat = as.table(dat), method = "cross.sectional", conf.level = 0.95, units = 100, outcome = "as.columns")
So once a package is installed, it can be used at any time (i.e. you don't have to download the package again). Be aware that R packages are frequently updated, so it is wise to periodically check CRAN for package updates.
Loops provide a means for repetitively carrying out a sequence of calculations. Say we'd like to determine an appropriate number of subjects required for a trial investigating the effect of a treatment on Planned Start of Mating (PSM) to conception intervals. We estimate that at 40 days after PSM 40% of the treated group and 45% of the control group will not have conceived. We'd like to make a plot of the required number of study subjects as a function of the study power. Create a vector of power estimates:
library(epiR); library(tidyr); library(dplyr)
power <- seq(from = 0.20, to = 0.90, by = 0.01); n <- c()
for(i in 1:length(power)){
tn <- epi.survivalsize(treat = 0.50, control = 0.40, n = NA, power = power[i], r = 1, design = 1, sided.test = 2, conf.level = 0.95)
n <- c(n, tn$n.total)
}
rval <- data.frame(power, n)
head(rval)
plot(x = rval$power, y = rval$n, xlim = c(0,1), ylim = c(0, 1000), xlab = "Power", ylab = "Total number of study subjects", type = "p", pch = 16)
Next, we might wish to produce a matrix of required sample sizes for different levels of power and levels of treatment effect. Rather than running the above function call five times (very tedious) we create a loop statement:
dlong <- expand.grid(power = c(0.50, 0.60, 0.70, 0.80, 0.90), treat = c(0.41, 0.42, 0.43, 0.44, 0.45), n = NA)
for(i in 1:nrow(rval)){
tn <- epi.survivalsize(treat = rval$treat[i], control = 0.40, n = NA, power = rval$power[i], r = 1, design = 1, sided.test = 2, conf.level = 0.95)
dlong$n[i] <- tn$n.total
}
Convert the results data frame from long format to wide:
dwide <- dlong %>%
spread(key = treat, value = n)
dwide
R works with objects which all have two intrinsic attributes: mode and length. The mode is the kind of elements of an object. There are four data modes: numeric, character, complex, and logical. Modes that do not characterise data include: function, expression, or formula. Length describes the total number of elements in an object. The following table summarises the differents objects manipulated by R.
| Object | Possible modes | Several modes? |
| vector | numeric, character, complex, or logical | No |
| factor | numeric, or character | No |
| array | numeric, character, complex, or logical | No |
| matrix | numeric, character, complex, or logical | No |
| data.frame | numeric, character, complex, or logical | Yes |
| list | numeric, character, complex, logical, function, expression, or formula | Yes |
| ts | numeric, character, complex, or logical | Yes |
A vector is a single column of values. A factor is a categorical variable. An array is a table with k dimensions, a matrix being a particular case of array with k = 2. Note that the elements of an array or of a matrix are all of the same mode. A data.frame is a table composed with several vectors all of the same length but possibly of different modes. A ts (time-series) data set contains supplementary attributes for time series analyses such as frequency and date.
Among the non-intrinsic attributes of an object dim is important. dim corresponds to the dimensions of a multivariate object. For example, a matrix with 2 lines and 2 columns has for dim the pair of values [2,2], but its length is 4.
Continuing the 2 × 2 table example above. We first generated a matrix containing the data for the 2 × 2 table:
dat <- matrix(c(13,2163,5,3349), nrow = 2, byrow = TRUE)
What the dimensions of the object dat?
dim(dat)
The dimensions of object data are [1] 2 2: two rows and two columns.
R discriminates, for the names of the objects, upper-case characters from lower-case characters (i.e. it is case-sensitive), so that x and X can be used to name distinct objects (even in Windows):
x <- 1; X <- 10
List the objects in your R environment:
ls()
In this exercise we will use R to read a dataset and produce some descriptive statistics, produce some charts, and perform some simple statistical inference. The aim of the exercise is for you to become familiar with R and some basic R functions and objects.
The data set we will use for this exercise is called epi.fem.csv. The data consist of 118 records of eight variables for female psychiatric patients. Copy this file to your working directory (e.g. D:\TEMP). Start R, set your working directory, and read in the epi.fem.csv file, giving it the name fem:
setwd("D:\\TEMP")
fem <- read.table(file = "epi.fem.csv", header = TRUE, sep = ",")
You can inspect the contents of the fem data frame (or any other R object) just by typing its name:
fem
Note that the fem object is built from other objects. You can view the first six rows of the data frame using the function head:
head(fem)
The data consist of 118 records of nine variables for female psychiatric patients. The first ten records of the fem data frame are:
| id | age | iq | anxiety | depress | sleep | sex | life | weight |
| 1 | 39 | 94 | 2 | 2 | 2 | 2 | 2 | 2.23 |
| 2 | 41 | 89 | 2 | 2 | 2 | 2 | 2 | 1.00 |
| 3 | 42 | 83 | 3 | 3 | 3 | 2 | 2 | 1.82 |
| 4 | 30 | 99 | 2 | 2 | 2 | 2 | 2 | -1.18 |
| 5 | 35 | 94 | 2 | 1 | 1 | 2 | 1 | -0.14 |
| 6 | 44 | 90 | NA | 1 | 2 | 1 | 1 | 0.41 |
id: patient identifier.
age: Age in years.
iq: Intelligence quotient.
anxiety: Anxiety (1 = none, 2 = mild, 3 = moderate, 4 = severe).
depress: Depression (1 = none, 2 = mild, 3 = moderate or severe).
sleep: Sleeping normally (1 = yes, 2 = no).
sex: Lost interest in sex (1 = yes, 2 = no).
life: Considered suicide (1 = yes, 2 = no).
weight: Weight change (kg) in previous 6 months.
If you want to see more than the first six rows, specify the number of rows to view directly. Say we want to view rows 1 to 10:
fem[1:10,]
You can access the contents of a single columns by name:
fem$iq
fem$iq[1:10]
The $ sign is used to separate the name of the data frame and the name of the column of interest. You can also access rows, columns, and individual cells by specifying row and column positions. For example, the iq column is the third column in the fem data frame:
fem[,3]
fem[9,]
fem[9,3]
There are missing values in the iq column which are all coded as -99. Before proceeding we must set these to the special NA (missing value) designator:
fem$iq[fem$iq == -99] <- NA
The term inside the square brackets is called an index and is used to refer to subsets of data held in an object. In this situation we are instructing R to set the contents of the iq variable to NA if the contents of the iq variable is -99. Check that this has worked:
fem$iq
We might now like to make a subset of the data, selecting all variables for those patients where anxiety == "NA" and depress == 2 and writing the results to a new data frame called temp. There are two methods for doing this:
Method 1:
temp <- subset(x = fem, subset = is.na(anxiety) & depress == 2)
Method 2:
id <- is.na(fem$anxiety) & fem$depress == 2
temp <- fem[id,]
id is a TRUE or FALSE flag assigned to each patient in fem. Those patients where the selection criteria have been met, id is "TRUE", all others are "FALSE". temp <- fem[id,] extracts all columns of data frame fem where id = TRUE assigning it the name temp.
Descriptive statistics for all variables in the data frame:
summary(fem)
Probably easier to appreciate the nature of the data if we use histograms. First of all, we consider the continuous variables age and iq:
par(mfrow = c(2,2))
hist(fem$age)
hist(fem$iq)
Now plot the categorical variables:
par(mfrow = c(2,2))
hist(fem$anxiety, breaks = c(1,2,3,4))
hist(fem$depress)
hist(fem$sleep)
hist(fem$sex)
par(mfrow = c(2,2))
hist(fem$life)
hist(fem$weight)
We can now compare the groups who have and have not considered suicide by tabulating summary statistics of the other columns for the two groups. For example, for iq:
by(data = fem$iq, INDICES = fem$life, FUN = summary)
The by() function applies another function (in this case the summary() function) to a column in a data frame (in this case fem$iq) split by the value of another variable (in this case fem$life). It can be tedious to always have to specify a data frame each time we want to use a particular variable. We can fix this problem by `attaching' it to our R session:
attach(fem)
We can now refer to the columns in the fem data frame without having to specify the name of the data frame itself. This time we will produce summary statistics for weight by life:
by(data = weight, INDICES = life, FUN = summary)
View the same data as a box and whisker plot:
boxplot(weight ~ life)
Add axis labels and tidy the graph up:
boxplot(weight ~ life, boxwex = 0.25, ylim = c(-4, 4), col = "blue", xlab = "Life", ylab = "Weight", main = "Weight change by considered suicide")
The groups do not seem to differ much in their medians and the distributions appear to be reasonably symmetrical about their medians with a similar spread of values. We can look at the distribution as histograms, firstly `without any frills':
par(mfrow = c(2,2))
hist(weight[life == 1])
hist(weight[life == 2])
Tidy the plots up, using common x and y axes (so the distributions are easier to compare):
par(mfrow = c(2,2))
hist(weight[life == 1], breaks = seq(-4, 4, by = 1), xlim = c(-4, 4), ylim = c(0, 20), col = "blue", xlab = "Weight gain (kg)", main = "Suicide: yes")
hist(weight[life == 2], breaks = seq(-4, 4, by = 1), xlim = c(-4, 4), ylim = c(0, 20), col = "blue", xlab = "Weight gain (kg)", main = "Suicide: no")
Check the assumption of normality using quantile-quantile plots:
par(mfrow = c(2,2))
qqnorm(weight[life == 1], xlim = c(-2, 2), ylim = c(-4, 4), main = "Suicide: yes")
qqline(weight[life == 1])
qqnorm(weight[life == 2], xlim = c(-2, 2), ylim = c(-4, 4), main = "Suicide: no")
qqline(weight[life == 2])
Plot the histograms and the quantile-quantile plots together and save the output as a *.png file (that might be included in a report):
par(mfrow = c(2,2))
hist(weight[life == 1], breaks = seq(-4, 4, by = 1), xlim = c(-4, 4), ylim = c(0, 20), col = "blue", xlab = "Weight gain (kg)", main = "Suicide: yes")
hist(weight[life == 2], breaks = seq(-4, 4, by = 1), xlim = c(-4, 4), ylim = c(0, 20), col = "blue", xlab = "Weight gain (kg)", main = "Suicide: no")
qqnorm(weight[life == 1], xlim = c(-2, 2), ylim = c(-4, 4), main = "Suicide: yes")
qqline(weight[life == 1])
qqnorm(weight[life == 2], xlim = c(-2, 2), ylim = c(-4, 4), main = "Suicide: no")
qqline(weight[life == 2])
savePlot(filename = "mygraph", type = c("png"), device = dev.cur())
Run a formal test for normality using the Shapiro test (here, the null hypothesis is that the data is normally distributed: small P values indicate a lack of normality):
shapiro.test(x = weight[life == 1])
shapiro.test(x = weight[life == 2])
Remember that we can use the by() function to apply a function to a data frame, including statistical functions such as shapiro.test():
by(data = weight, INDICES = life, FUN = shapiro.test)
We can also test whether the variances differ significantly using Bartlett's test for the homogeneity of variances:
bartlett.test(x = weight, g = life)
There is no significant differences between the two variances. Having checked normality and homogeneity of variance assumptions we can proceed to carry out a t-test:
t.test(x = weight[life == 1], y = weight[life == 2], var.equal = TRUE)
There is no evidence that the two groups differ in their weight change in the previous six months. Note that we could still have performed a t-test if the variances were not homogenous by setting var.equal = FALSE:
t.test(x = weight[life == 1], y = weight[life == 2], var.equal = FALSE)
We can explore the correlation between two variables using the cor() function:
cor(x = iq, y = weight, use = "pairwise.complete.obs")
or by using a scatter plot:
plot(x = iq, y = weight)
and by a formal test:
cor.test(x = iq, y = weight)
With some functions you can pass a whole data frame rather than a list of variables:
cor(x = fem, use = "pairwise.complete.obs")
pairs(fem)
The output can be a little confusing particularly if it includes categorical or record identifying variables. To get round this we can create a new object that contains only the columns we are interested in using the column binding cbind() function:
newfem <- cbind(age, iq, weight)
cor(x = newfem, use = "pairwise.complete.obs")
pairs(x = newfem)
When we have finished with the newfem object we can remove it from the R environment:
rm(newfem)
Note that there was no real need to create the newfem object as we could have fed the output of the cbind() function directly to the cor() or pairs() functions:
cor(x = cbind(age, iq, weight), use = "pairwise.complete.obs")
pairs(x = cbind(age, iq, weight))
It is probably easier to work with the newfem object rather than having to retype the cbind() function. This is particularly true if you wanted to continue with an analysis of just the three variables. The relationship between age and weight can be plotted using the plot() function:
plot(x = age, y = weight)
And tested using the cor() and cor.test() functions:
cor(x = age, y = weight, use = "pairwise.complete.obs")
cor.test(x = age, y = weight)
Or by using a linear regression model with the lm() function:
summary(lm(weight ~ age))
We use the summary() function here to extract summary information from the output of the lm() function. It can be useful to use lm() to create an object:
fem.lm01 <- lm(weight ~ age)
And use the output in other functions:
summary(fem.lm01)
plot(x = age, y = weight)
abline(fem.lm01)
In this case we are passing the intercept and slope information held in the fem.lm object to the abline() function which draws the regression line onto the scatterplot.
A useful function to apply to the fem.lm object is plot() which produces diagnostic plots of the linear model:
plot(fem.lm01)
Before we finish we should save the fem data frame so that next time we want to use it we will not have to bother with recoding the missing values to the special NA value. This is done with the write.table() function:
write.table(x = fem, file = "newfem.csv", row.names = FALSE, sep = ",")
Everything in R is either a function or an object --- even the command to quit R is a function:
q()
When you call the q() function you will be asked if you want to save the workspace image. If you save the workspace image then all the objects and functions currently available to you (in this case the fem data frame and the fem.lm linear model) will be saved in an *.RData file. These will then be automatically restored the next time you start R in the current working directory. For this exercise there is no need to save the workspace image so click `No'.
The on-line help of R provides useful information on how to use the functions. The help in HTML format is called by typing:
help.start()
A search with key-words is possible with this html help. A search of functions can be done with apropos("what") which lists the functions with 'what' in their name:
apropos("anova")
The help is available in text format for a given function, for instance:
?lm
displays the help file for the function lm().
The function ls() lists the objects in memory: only the names of the objects are displayed.
name <- "Terry"; n1 <- 10; n2 <- 100; m <- 0.5
ls()
Yields [1] "m" "n1" "n2" "name". Note the use of the semi-colon `;' to separate distinct commands on the same line. If there are a lot of objects in memory it may be useful to list those which contain given character in their name: this can be done with the option pattern (which can be abbreviated with pat):
ls(pat = "m")
Yields [1] "m" "name". If we want to restrict the list of objects whose names start with this character:
ls(pat = "^m")
Yields [1] "m". To display the details on the objects in memory, we can use the function ls.str():
ls.str()
The option pattern can also be used as described above. Another useful option of ls.str() is max.level which specifies the level of details to be displayed for composite objects. By default, ls.str() displays the details of all objects in memory, including the columns of data frames, matrices and lists, which can result in a very long display. One can avoid to display all these details with the option max.level = -1:
M <- data.frame(n1, n2, m)
ls.str(pat = "M")
To delete objects in memory, we use the function rm(): rm(x) deletes the object x, rm(x,y) deletes both x and y objects; rm(list = ls()) deletes all objects in memory. The same options mentioned for the function ls() can then be used to selectively delete objects:
rm(list = ls(pat = "m"))
![\TeX \begin{figure}[h]
\centering
\includegraphics[width=15cm]{../../images/R_pch_symbols.png}
\caption{Plotting symbols in R and their numeric codes.}
\label{fig:R_pch_symbols}
\end{figure}](../../images/R_pch_symbols.png)