Non-parametric and semi-parametric regression

Selection of explanatory variables

Load the survival library. Read the addict data file into R:

library(survival)
setwd("D:\\TEMP")
dat <- read.table("addict.csv", header = TRUE, sep = ",")

Set contrasts for clinic and prison. Set the reference category for clinic, making Clinic 1 (base = 1) the reference category:

dat$clinic <- factor(dat$clinic, levels = c(1, 2), labels = c("1", "2"))
contrasts(dat$clinic) <- contr.treatment(2, base = 1, contrasts = TRUE)
levels(dat$clinic)

Same for prison, making absence of a prison record the reference category:

dat$prison <- factor(dat$prison, levels = c(0, 1), labels = c("0", "1"))
contrasts(dat$prison) <- contr.treatment(2, base = 1, contrasts = TRUE)
levels(dat$prison)

Assess the influence of clinic, prison and dose on days to relapse. First of all categorise dose into four classes based on quartiles:

quantile(dat$dose, probs = c(0.25, 0.50, 0.75))
hist(dat$dose)

Quartiles for dose are 50, 60 and 70. Create a categorical variable based on dose:

cdose <- rep(0, time = nrow(dat))
cdose[dat$dose < 50] <- 1
cdose[dat$dose >= 50 & dat$dose < 60] <- 2
cdose[dat$dose >= 60 & dat$dose < 70] <- 3
cdose[dat$dose >= 70] <- 4
dat <- cbind(dat, cdose)

Assess the effect of clinic, prison and cdose on days to relapse:

addict.km01 <- survfit(Surv(stop, status) ~ clinic, type = "kaplan-meier", data = dat)
addict.km02 <- survfit(Surv(stop, status) ~ prison, type = "kaplan-meier", data = dat)
addict.km03 <- survfit(Surv(stop, status) ~ cdose, type = "kaplan-meier", data = dat)

Plot all Kaplan-Meier curves on one page. The mark.time = FALSE argument disables the censor marks:

par(pty = "s", mfrow = c(2,2))
plot(addict.km01, xlab = "Days to relapse", ylab = "Cumulative proportion to experience event", main = "Clinic", lty = c(1,2), mark.time = FALSE)
legend(x = "topright", legend = c("Clinic 1","Clinic 2"), lty = c(1,2), bty = "n", cex = 0.80)
plot(addict.km02, xlab = "Days to relapse", ylab = "Cumulative proportion to experience event", main = "Prison", lty = c(1,2), mark.time = FALSE)
legend(x = "topright", legend = c("Prison absent","Prison present"), lty = c(1,2), bty = "n", cex = 0.80)
plot(addict.km03, xlab = "Days to relapse", ylab = "Cumulative proportion to experience event", main = "Dose categories", lty = c(1,2,3,4), mark.time = FALSE)
legend(x = "topright", legend = c("Dose 1","Dose 2", "Dose 3", "Dose 4"), lty = c(1,2,3,4), bty = "n", cex = 0.80)

Log-rank tests:

survdiff(Surv(stop, status) ~ clinic, data = dat, na.action = na.omit, rho = 0)
survdiff(Surv(stop, status) ~ prison, data = dat, na.action = na.omit, rho = 0)
survdiff(Surv(stop, status) ~ cdose, data = dat, na.action = na.omit, rho = 0)

The variables clinic and dose (as a categorical variable) influence days to relapse. The variable prison is not significant when tested with a log-rank test (P = 0.28), but since it is considered to be biologically important it is retained in our model.

Fit multivariable model

Days to relapse depends on clinic, prison and dose:

addict.cph01 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose, method = "breslow", data = dat)
summary(addict.cph01)

Variables clinic and dose significantly influence time to relapse (P = 2.6E-06 and 3.1E-08, respectively). Variable prison approaching significance (P = 0.06). Drop variable prison (using the update function):

addict.cph02 <- update(addict.cph01, ~. - prison)
summary(addict.cph02)

Now include an interaction term for clinic and prison:

addict.cph03 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose + (clinic * prison), method = "breslow", data = dat)
summary(addict.cph03)

Does addict.cph03 provide a better fit to the data than addict.cph01?

x2 <- -2 * (addict.cph01$loglik[2] - addict.cph03$loglik[2])
1 - pchisq(x2,2)

Inclusion of the interaction terms has no significant effect on model fit (P = 0.17). Now compare observed survival (as estimated by the Kaplan-Meier technique) and predictions from Cox model addict.cph01. Start by creating two data frames. The first for a subject from clinic 1 without a prison record and on a maximum daily methadone dose of 50 mg. The second for a subject from clinic 2 without a prison record and on a maximum daily methadone dose of 50 mg:

addict.km01 <- survfit(Surv(stop, status) ~ clinic, type = "kaplan-meier", data = dat)
dat.clin01 <- data.frame(clinic = factor(c(1)), prison = factor(c(0)), dose = 50)
dat.clin02 <- data.frame(clinic = factor(c(2)), prison = factor(c(0)), dose = 50)

plot(addict.km01, xlab = "Days to relapse", ylab = "Cumulative proportion to experience event", lty = c(1,2), mark.time = FALSE, main = "")
lines(survfit(addict.cph01, newdata = dat.clin01), conf.int = FALSE, mark.time = FALSE, col = "red", lty = 1)
lines(survfit(addict.cph01, newdata = dat.clin02), conf.int = FALSE, mark.time = FALSE, col = "red", lty = 2)
legend(x = "topright", legend = c("Clinic 1: Kaplan Meier", "Clinic 2: Kaplan Meier", "Clinic 1: Cox model", "Clinic 2: Cox model"), lty = c(1,2,1,2), col = c("black", "black", "red", "red"), bty = "n", cex = 0.80)

The Cox model underestimates the pattern of relapse, particularly for patients from Clinic 2. What wording can we use to explain these results?

After adjusting for the effect of presence of a prison record and methadone dose individuals that attended clinic 2 had an instantaneous risk of relapse at any given day that was 0.36 (95% CI 0.24 to 0.55) times the instantaneous risk of those that attended clinic 1. Alternatively, we can say that after adjusting for prison record and methadone dose individuals that attended clinic 2 had a daily log hazard of relapse that was 1.0092 (95% 0.5883 to 1.4300) times lower than those that attended clinic 1.

The first expression (talking about instantaneous risk) is perhaps clearer for non technical readers than the second.

Here we fit a Cox proportional hazards model (with no explanatory variables) to the addict data and extract the baseline (cumulative) hazard function, saving the result as addict.bhc00:

addict.cph00 <- coxph(Surv(stop, status, type = "right") ~ 1, method = "breslow", data = dat)
addict.bhc00 <- basehaz(addict.cph00)

Kaplan-Meier survival function, for comparison:

addict.km00 <- survfit(Surv(stop, status) ~ 1, type = "kaplan-meier", data = dat)

Compare the baseline hazard from the Cox model with the cumulative hazard from the Kaplan-Meier function. The functions are identical, so we add a small amount (0.05) to the baseline hazard to make the comparison clearer:

plot(x = addict.km00$time, y = addict.km00$cumhaz, ylim = c(0,3), xlab = "Days to relapse", ylab = "Cumulative hazard", col = "black", type = "s")
lines(x = addict.bhc00$time, y = addict.bhc00$hazard + 0.05, col = "red", type = "s")
legend(x = "topleft", legend = c("Cumulative hazard (Kaplan-Meier)", "Baseline cumulative hazard (CPH)"), lty = c(1,1), col = c("black", "red"), bty = "n")

Check scale of continuous explanatory variables (method 1)

Replace the continuous covariate dose with design (dummy) variables. Plot the estimated coefficients versus the midpoint of each group:

dat$clinic <- factor(dat$clinic, levels = c(1, 2), labels = c("1", "2"))
contrasts(dat$clinic) <- contr.treatment(2, base = 1, contrasts = TRUE)
dat$prison <- factor(dat$prison, levels = c(0, 1), labels = c("0", "1"))
contrasts(dat$prison) <- contr.treatment(2, base = 1, contrasts = TRUE)

cdose <- rep(0, length(dat[,1]))
cdose[dat$dose < 50] <- 1
cdose[dat$dose >= 50 & dat$dose < 60] <- 2
cdose[dat$dose >= 60 & dat$dose < 70] <- 3
cdose[dat$dose >= 70] <- 4
dat <- cbind(dat, cdose)

dat$cdose <- factor(dat$cdose, labels = c("1", "2", "3", "4"))
contrasts(dat$cdose) <- contr.treatment(4, base = 1, contrasts = TRUE)

addict.cph03 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + cdose, method = "breslow", data = dat)
summary(addict.cph03)

addict.cph03$coefficients
addict.cph03$coefficients[3:5]

x <- c(((50 + min(dat$dose))/2), 55, 65,((max(dat$dose) + 70)/2))
y <- c(0, addict.cph03$coefficients[3:5])
plot(x, y, xlim = c(0, 100), type = "l", xlab = "Dose", ylab = "Regression coefficient")

The scale of the continuous explanatory variable dose is linear --- no transformations required.

Check scale of continuous explanatory variables (method 2)

Plot covariate values versus the Martingale residuals (and their smooth) from a model that excludes the covariate of interest, dose:

addict.cph04 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison, method = "breslow", data = dat)
addict.mg04 <- residuals(addict.cph04, type = "martingale")
plot(dat$dose, addict.mg04, ylim = c(-3,3))
lines(lowess(dat$dose, addict.mg04, iter = 0))
abline(h = 0, lty = 2, col = "gray")

Plot of the covariate values versus the log of the ratio of smoothed censor to smoothed cumulative hazard:

addict.cph01 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose, method = "breslow", data = dat)
addict.mg01 <- residuals(addict.cph01, type = "martingale")
addict.hi01 <- dat$status - addict.mg01
addict.clsm01 <- lowess(dat$dose, dat$status, iter = 0)
addict.hlsm01 <- lowess(dat$dose, addict.hi01, iter = 0)
addict.yi01 <- log(addict.clsm01$y / addict.hlsm01$y) + (addict.cph01$coefficients[3] * dat$dose)
plot(addict.yi01, dat$dose)

Now the two plots together:

par(pty = "s", mfrow = c(1,2))
plot(dat$dose, addict.mg04, ylim = c(-3,3))
lines(lowess(dat$dose, addict.mg04, iter = 0))
abline(h = 0, lty = 2, col = "gray")
plot(addict.yi01, dat$dose)

A linear relationship between the covariate values and each of the calculated parameters is evident, indicating that the continuous variable dose is linear in its log hazard.

Testing the proportional hazards assumption

Proportional hazards assumption test based on Schoenfeld residuals:

addict.cph01 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose, method = "breslow", data = dat)
addict.zph <- cox.zph(addict.cph01)

par(pty = "s", mfrow = c(2,2))
plot(addict.zph[1], main = "Clinic"); abline(h = 0, lty = 2)
plot(addict.zph[2], main = "Prison"); abline(h = 0, lty = 2)
plot(addict.zph[3], main = "Dose"); abline(h = 0, lty = 2)

The variability band for clinic displays a negative slope over time, suggesting non-proportionality of hazards. Formally test the proportional hazards assumption for all variables in addict.cph01:

cox.zph(addict.cph01, global = TRUE)

Using the cox.zph function, rho is the Pearson product-moment correlation between the scaled Schoenfeld residuals and time. The hypothesis of no correlation is tested using test statistic chisq. In the above example, the significant cox.zph test for clinic (P < 0.01) implies that the proportional hazards assumption as been violated for the clinic variable. This notion is supported by the Schoenfeld residual plots. An alternative (and less sensitive) means of testing the proportional hazards assumption is to plot log[-log S(t)] versus time (or the log of time). If the proportional hazards assumption holds the curves should be reasonably parallel. First we plot log[-log S(t)] versus the log of time:

clinic.km <- survfit(Surv(stop, status) ~ clinic, type = "kaplan-meier", data = dat)

plot(clinic.km, fun = "cloglog", lty = c(1,2), xlab = "Days to relapse (log scale)", ylab = "Log cumulative hazard")
legend(x = "topleft", legend = c("Clinic 1", "Clinic 2"), lty = c(1,2), bty = "n")

Now plot log[-log S(t)] versus time:

clinic.km <- survfit(Surv(stop, status) ~ clinic, type = "kaplan-meier", data = dat)
clinic <- c(rep(1, times = clinic.km$strata[1]), rep(2, times = clinic.km$strata[2]))
clinic.haz <- data.frame(clinic, time = clinic.km$time, surv = clinic.km$surv)
clinic1 <- log(-log(clinic.haz$surv[clinic.haz$clinic == 1]))
clinic2 <- log(-log(clinic.haz$surv[clinic.haz$clinic == 2]))

plot(c(clinic.haz$time[clinic.haz$clinic == 1], clinic.haz$time[clinic.haz$clinic == 2]), c(clinic1, clinic2), type = "n", ylim = c(-5, 2), xlab = "Days to relapse", ylab = "Log cumulative hazard", main = "Clinic")
lines(clinic.haz$time[clinic.haz$clinic == 1], clinic1, type = "s", lty = 1)
lines(clinic.haz$time[clinic.haz$clinic == 2], clinic2, type = "s", lty = 2)
legend(x = "topleft", legend = c("Clinic 1", "Clinic 2"), lty = c(1, 2), bty = "n")

We could be talked into concluding that the -log[-log S(t)] vs time plots for clinic are parallel --- conflicting with the findings of the Schoenfield residual test above.

Residuals

Deviance residuals:

addict.cph01 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose, method = "breslow", data = dat)
addict.res <- residuals(addict.cph01, type = "deviance")

par(pty = "s", mfrow = c(2, 2))
boxplot(addict.res ~ dat$clinic, main = "Clinic"); abline(h = 0, lty = 2)
boxplot(addict.res ~ dat$prison, main = "Prison"); abline(h = 0, lty = 2)
plot(dat$dose, addict.res, xlab = "Dose", ylab = "Deviance residual", main = "Dose"); abline(h = 0, lty = 2)

The following plots show the change in each regression coefficient when each observation is removed from the data (influence statistics). The changes plotted are scaled in units of standard errors and changes of less than 0.1 are of little concern. Plot influence statistics (using a common scale for the vertical axis: -0.1 to +0.1):

addict.res <- resid(addict.cph01, type = "dfbeta")
par(mfrow = c(2, 2))
main <- c("Clinic", "Prison", "Dose")
for (i in 1:3){
    plot(1:238, addict.res[,i], type = "h", ylim = c(-0.1,0.1), xlab = "Observation", ylab = "Change in coefficient")
    title(main[i])
  }

The above plots give an idea of the influence individual observations have on the estimated regression coefficients for each covariate. Data sets where the influence plot is tightly clustered around zero indicate an absence of influential observations. Now plot the Martingale residuals:

res <- residuals(addict.cph01, type = "martingale")
X <- dat[,c("clinic", "prison", "dose")]

par(mfrow = c(2,2))
for(j in 1:3){
  plot(X[,j], res, xlab = c("Clinic", "Prison", "Dose")[j], ylab = "Martingale residuals")
  abline(h = 0, lty = 2)
  lines(lowess(X[,j], res))
  }

X$clinic <- as.numeric(levels(X$clinic))[X$clinic]
X$prison <- as.numeric(levels(X$prison))[X$prison]

par(mfrow = c(2,2))
b <- coef(addict.cph01[1:3])
for(j in 1:3){
  plot(X[,j], b[j] * X[,j] + res, xlab = c("Clinic", "Prison", "Dose")[j],
    ylab = "Component + residual")
  abline(lm(b[j] * X[,j] + res ~ X[,j]), lty = 2)
  lines(lowess(X[,j], b[j] * X[,j] + res, iter = 0))
  }

Overall goodness of fit

Cox model:

addict.cph01 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose, method = "breslow", data = dat)
summary(addict.cph01)

Log partial likelihood for the [intercept-only] model and for the fitted model:

addict.cph01$loglik[1]; addict.cph01$loglik[2]

Schemper and Stare (1996) R2:

r.square <- 1 - exp( (2/length(dat[,1])) * (addict.cph01$loglik[1] - addict.cph01$loglik[2]))
r.square

Dealing with violation of the proportional hazards assumption

From the analyses conducted so far, we conclude that the proportional hazards assumption has been violated for the variable clinic. One method of dealing with this is to stratify the model by clinic. This means that we produce a separate baseline hazard function for each level of clinic. Note that by stratifying we cannot obtain a hazard ratio for clinic since the `clinic effect' is absorbed into the baseline hazard.

addict.cph01 <- coxph(Surv(stop, status, type = "right") ~ clinic + prison + dose, method = "breslow", data = dat)
addict.cph04 <- coxph(Surv(stop, status, type = "right") ~ strata(clinic) + prison + dose, method = "breslow", data = dat)
summary(addict.cph04)

Variable Subjects Failed Regression coefficient (SE) P Hazard ratio (95% CI)
Prison:       < 0.01  
   Prison absent 127 81 -   1.00
   Prison present 111 69 0.3760 (0.1689)   1.46 (1.05 - 2.03)
Dose: 238 150 -0.0350 (0.0064) < 0.01 0.97 (0.95 - 0.98)

Compare the clinic + dose + prison model with the stratified model:

x2 <- 2 * (addict.cph04$loglik[2] - addict.cph01$loglik[2])
1 - pchisq(x2, 1)

The stratified model provides a significantly better fit. Parameterising clinic as a time dependent covariate would be one option for dealing with non-proportionality of hazards and retaining the ability to quantify the effect of clinic. Plot Kaplan-Meier survival curves for each clinic, adjusting for the effect of prison and methadone dose:

plot(survfit(addict.cph04), lty = c(1,2), xlab = "Days to relapse", ylab = "Cumulative proportion to experience event")
legend(x = "topright", legend = c("Clinic 1","Clinic 2"), lty = c(1,2), bty = "n")

Poisson regression

A Cox model can be estimated using standard Poisson regression techniques by splitting the data finely and specifying the model as having one rate parameter per time interval. The relevant time-varying explanatory variables should be computed for each interval and fixed explanatory variables should be carried over to all intervals for a given individual. The split data makes a clear distinction between risk time which is the length of each time interval and time scale which is the value of the time scale at (the beginning of) each interval. Using a Poisson approach the log-risk time is used as offset and a smoothed estimate of time is used as a covariate.

Since everything possible using a Cox approach can be done using Poisson modelling of split data, there is no loss of capability by switching to Poisson modelling. When stratification or time-dependent variables are required, the facilities using a standard Cox approach limits the ways in which the desired interactions can be modelled, and can distract the analyst from realising that other interactions between explanatory variables may be of interest.

library(survival)
setwd("D:\\TEMP")
dat <- read.table("addict_poisson.csv", header = TRUE, sep = ",")

dat$clinic <- factor(dat$clinic, levels = c(1, 2), labels = c("1", "2"))
contrasts(dat$clinic) <- contr.treatment(2, base = 1, contrasts = TRUE)
dat$prison <- factor(dat$prison, levels = c(0, 1), labels = c("0", "1"))
contrasts(dat$prison) <- contr.treatment(2, base = 1, contrasts = TRUE)

Code to reformat data

Cox proportional hazards model:

library(Design)
addict.cph <- cph(Surv(stop, status) ~ clinic + prison + dose, surv = TRUE, data = dat)
addict.cph

Poisson model:

addict.glm <- glm(status ~ offset(log(stop - start)) + ns(stop) + factor(clinic) + prison + dose, family = poisson, data = dat)
summary(addict.glm)