The ordinary Cox model treats each outcome event as being independent. When the survival likelihood for subjects is not independent (say subjects are grouped into herds, regions, households) the assumption of independence may not hold. Failure to account for this lack of independence means that the variance of computed regression coefficients will be underestimated, making us more likely to make a Type I error (rejecting the null hypothesis when, in fact, it is really true). Two techniques may be applied to account for non-independent (i.e. correlated) data: (1) robust sandwhich estimators, and (2) frailty terms.
With robust variance estimates the observed data is resampled (with replacement) to achieve a sample of the same size each time, and to use the variation in the estimated parameters across the set of samples to obtain a value for the sampling variability of the estimates. With correlated data the sample needs to be drawn with replacement from the set of independent subjects (not observations) so that intra-subject correlation is preserved in the samples that are taken.
In R, the cluster argument within the cph function is used to compute a robust variance for a Cox proportional hazard regression model.
This example is taken from a paper by Wei, Lin and Weissfeld (1989). The study is of time to recurrence of bladder cancer. The data frame bladder2 has multiple rows for each subject. Many subjects had recurrences of bladder cancer, sometimes as many as four, and were followed beyond the fourth recurrence. Reference: Wei LJ, Lin DY, Weissfeld L (1989) Regression analysis of multivariate incomplete failure time data by modelling marginal distributions. Journal of the American Statistical Association. 84:1065-1073.
| id | rx | size | number | start | stop | event | enum |
| 1 | 1 | 3 | 1 | 0 | 1 | 0 | 1 |
| 2 | 1 | 1 | 2 | 0 | 4 | 0 | 1 |
| 3 | 1 | 1 | 1 | 0 | 7 | 0 | 1 |
| 4 | 1 | 1 | 5 | 0 | 10 | 0 | 1 |
| 5 | 1 | 1 | 4 | 0 | 6 | 1 | 1 |
| 5 | 1 | 1 | 4 | 6 | 10 | 0 | 2 |
id: patient identifier.
rx: 1 = placebo, 2 = thiopet.
size: size of the largest initial tumour.
number: number of initial tumours.
start: entry into the study or the time of last recurrence.
stop: time to event (months).
event: 0 = censored, 1 = event.
enum: event number.
library(survival)
setwd("D:\\TEMP")
dat <- read.table("bladder2.csv", header = TRUE, sep = ",")
bladder.cph01 <- coxph(Surv(start, stop, event) ~ rx + size + number, data = dat)
summary(bladder.cph01)
bladder.cph02 <- coxph(Surv(start, stop, event) ~ rx + size + number + cluster(id), data = dat)
summary(bladder.cph02)
| Variable | Subjects | Failed | Regression coefficient (SE) | P | Hazard ratio (95% CI) |
| Treatment | 190 | 112 | -0.4116 (0.1999) | 0.04 | 0.66 (0.45 - 0.98) |
| Size | 190 | 112 | -0.0411 (0.0703) | 0.56 | 0.96 (0.84 - 1.10) |
| Number | 190 | 112 | 0.1637 (0.0478) | < 0.01 | 1.18 (1.07 - 1.29) |
R2 = 0.074.
Likelihood ratio test = 14.7 on 3 df, P < 0.01.
| Variable | Subjects | Failed | Regression coefficient (SE) | P | Hazard ratio (95% CI) |
| Treatment | 190 | 112 | -0.4116 (0.2488) | 0.10 | 0.66 (0.41 - 1.08) |
| Size | 190 | 112 | -0.0411 (0.0742) | 0.58 | 0.96 (0.83 - 1.11) |
| Number | 190 | 112 | 0.1637 (0.0584) | < 0.01 | 1.18 (1.05 - 1.32) |
R2 = 0.074.
Likelihood ratio test = 14.7 on 3 df, P < 0.01.
In recent times there has been active research concerning the addition of random effects to survival models. In this setting, a random effect is a continuous variable that describes excess risk or frailty for distinct categories such as individuals, families or herds. The idea is that individuals have different frailties, and those who are most `frail' will experience failure earlier than others. The inclusion of the frailty term in a Cox model allows for the possible correlation between the recurrence times of an individual. The model that could be fitted is as follows:
h(t,X) = h0(t). exp (b1agei(j) + b2sexi(j) + uj)
Above, the term uj represents the frailty term for the jth cluster. Frailty terms can be specified as following a normal or gamma distribution. Under the normal assumption, frailty estimates are symmetric around zero. Under the gamma assumption, frailty estimates are asymmetric (allowing for groups displaying exceptionally low or high risk --- as in the case of genetic disease studies where the presence of a high-risk allele markedly increases the hazard of failure).
As an illustration, consider the data in the lung.csv file. These data relate to mortality among advanced lung cancer patients, conducted by the North Central Cancer Treatment Group. The subset used here contains details for 228 patients. Reference: Loprinzi CL, Laurie JA et al. (1994) Prospective evaluation of prognostic variables from patient-completed questionnaires. J. Clinical Oncol. 12:601-607.
| inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
| 3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | . |
| 3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
| 3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | . | 15 |
inst: enrolling institution.
time: day of event.
status: 1 = alive, 2 = dead.
age: patient age at enrolment.
sex: 1 = male, 2 = female.
ph.ecog: physician's estimate of ECOG performance score (0 to 4).
ph.karno: physician's estimate of Karnofsky score (an alternative to ECOG performance score). Values are 20, 30, ... , 100.
pat.karno: patient's estimate of Karnofsky score.
meal.cal: calories consumed at meals (excluding beverages and snacks).
wt.loss: weight loss in the last six months (a negative number implies weight gain).
There are 18 separate institutions that enrolled at least one subject in this trial. Because the enrolling institutions range from community practices to a large tertiary care centre, differences in the baseline risk of enrollees might be a concern. A fixed-effects Cox proportional hazards model (i.e. one that ignores intra-institutional correlation) would be called as follows:
library(survival)
setwd("D:\\TEMP")
dat <- read.table("lung.csv", header = TRUE, sep = ",")
lung.cph01 <- coxph(Surv(time, status) ~ sex + ph.karno + pat.karno, data = dat)
summary(lung.cph01)
Include enrolling institution as fixed effect:
dat$inst <- factor(dat$inst)
contrasts(dat$inst) <- contr.treatment(18, base = 1, contrasts = TRUE)
lung.cph02 <- coxph(Surv(time, status) ~ sex + ph.karno + pat.karno + inst, data = dat)
summary(lung.cph02)
Treating enrolling institution as a categorical variable (that is, a factor) is a satisfactory approach when there is only a small number of them. Presentation of the model becomes clumsy when there are large numbers (as in the example above). An alternative is to treat the variable inst as a frailty term:
lung.cph03 <- coxph(Surv(time, status) ~ sex + ph.karno + pat.karno + frailty(inst), data = dat)
summary(lung.cph03)
The variance of the frailty term is 0.00149. The p-value for the frailty term is 0.36. We conclude that the frailty term is not significant, that is the variance of the frailty term does not differ significantly from zero.
| Variable | Subjects | Failed | Regression coefficient (SE) | P | Hazard ratio (95% CI) |
| Sex | 228 | 165 | -0.5118 (0.1693) | < 0.01 | 0.60 (0.43 - 0.83) |
| Physician karno | 228 | 165 | -0.0062 (0.0068) | 0.37 | 0.99 (0.98 - 1.01) |
| Patient karno | 228 | 165 | -0.0170 (0.0065) | < 0.01 | 0.98 (0.97 - 1.00) |
R2 = 0.097.
Likelihood ratio test = 22.9 on 2 df, P < 0.01.
| Variable | Subjects | Failed | Regression coefficient (SE) | P | Hazard ratio (95% CI) |
| Sex | 228 | 165 | -0.5197 (0.1763) | < 0.01 | 0.59 (0.42 - 0.84) |
| Physician karno | 228 | 165 | -0.0108 (0.0073) | 0.14 | 0.99 (0.97 - 1.00) |
| Patient karno | 228 | 165 | -0.0190 (0.0069) | < 0.01 | 0.98 (0.97 - 0.99) |
| Institution: | |||||
| Inst 2 | 5 | 4 | 0.5159 (0.5427) | 0.34 | 1.67 (0.58 - 4.85) |
| Inst 3 | 19 | 15 | -0.4168 (0.3315) | 0.21 | 0.66 (0.34 - 1.26) |
| ... | |||||
| Inst 18 | 2 | 1 | -2.9509 (9.2902) | 0.75 | 0.05 (0.00 - 4.23E+06) |
R2 = 0.173.
Likelihood ratio test = 42.4 on 20 df, P < 0.01.
| Variable | Subjects | Failed | Regression coefficient (SE) | P | Hazard ratio (95% CI) |
| Sex | 228 | 165 | -0.5098 (0.1695) | < 0.01 | 0.60 (0.43 - 0.84) |
| Physician karno | 228 | 165 | -0.0062 (0.0068) | 0.37 | 0.99 (0.98 - 1.00) |
| Patient karno | 228 | 165 | -0.0170 (0.0065) | < 0.01 | 0.98 (0.97 - 1.00) |
R2 = 0.099.
Variance of random effect: 0.00149.
Likelihood ratio test = 23.2 on 3.21 df, P < 0.01.
A ranked error bar plot of the frailty terms for each institution allow us to better-appreciate those institutiions that are prone to `earlier failures' than others, as shown in Figure 1.
![\TeX \begin{figure}[h]
\centering
\includegraphics[width=10cm]{../images/frailty.png}
\caption{Influence of institution on the daily hazard of death for patients enrolled in a study of advanced lung cancer patients, conducted by the North Central Cancer Treatment Group. Institution identifiers are shown at the top of the plot. The effect of institution on hazard of death is subtle, with institutions 15, 11, and 3 demonstrating the lowest daily hazard of failure. Confidence intervals are wide, demonstrating no clear evidence that one institution is associated with lower or higher hazards of failure over another.}
\label{fig:frailty}
\end{figure}](../images/frailty.png)