PROGRAM FOR COX MODEL ANALYSIS WITH MISCLASSIFIED DISCRETE COVARIATES
Reference: Zucker, D.M. and Spiegelman, D. (2004). Inference for the proportional
hazards model with misclassified discrete-valued covariates. Biometrics 60:324-334.
This note describes how to run a Fortran program that implements the method
presented in the above-cited reference. This is a method for Cox regression
analysis with discrete covariates subject to possible misclassification.
The setup is as follows. There are "nb" discrete covariates, subject to possible
misclassification. There are "k" possible configurations of the covariate vector
(we include only those configurations that actually appear in the data set).
What is observed is not the true covariate vector, but an error-prone surrogate.
It is assumed that the surrogate has the same range of levels as the true covariate.
We have a matrix "A" that describes the classification probabilities: A(i,j) is the
probability that the true configuration is the j-th one given that the observed
configuration is the i-th one. The matrix "A" typically is a function of some
number "nq" of parameters, call them omega_l. If the matrix "A" is assumed to
be known without error, then "nq" is taken equal to 0.
There are two versions of the program. The basic program consists of two
Fortran files: a driver (donrunA.f) and a subroutine file (donsubsA.f). The
extended program consists of two similar files, except with "B" instead of "A"
in the filenames. The extended program covers an extension to left truncation
(which is not discussed in the original paper). The extended program runs
somewhat more slowly than the basic program.
To run the program, it is necessary to prepare two files, the main data file
and a "control card" file. The main file must be called input.dat. The control
card file must be called cntlr.crd. The results are output to donna.out.
The format of the main data file is as follows. There is one record for each
subject. Each record consists of the following elements.
1. The covariate configuration (stratum) associated with the subject (this will
be an integer from 1 to "k").
2. A 0-1 indicator variable for whether the subject had an event (1) or was censored (0).
3. The time (e.g. age) at which the subject entered the cohort. This is to accomodate
left truncation, which is common in studies where age is the time variable. If there
is no left truncation at all, then this variable should be set equal to 0 for all
subjects. (If the basic program is being run rather than the extended program, this
column should be OMITTED.)
4. The time at which the subject left the cohort (either due to an event or to censoring).
The format of the "control card" file is as follows. The first line should contain
the number of subjects in the data set and the values of "k", "nb", and "nq" as defined
above. These should be in integer format, and there should be a space between each number.
The second line should contain the following numbers (the first three are either 0 or 1,
the last is a nonnegative real number): iopt, iropt, icpr, zval. There should be a space
between each number. Zucker and Spiegelman describe a non-iterative estimator and an
iterative estimator. The value of iopt should be 0 if the non-iterative estimator is
desired and 1 if the iterative estimator is desired. The variable icpr is a 0-1 flag
for whether to print out the full covariance matrix of the estimated regression parameters
(icpr=1) or just the standard errors of the individual coefficients (icpr=0). The variables
iropt and zval relate to the possibility of introducing a certain adjustment in the estimation
calculation. For most working purposes, it is recommended to set iropt=0 and zval=0.0.
There then should follow "k" records for the "k" covariate configurations. Each record
should contain "nb" elements (columns). In the l-th record, the m-th element should be the
value of the m-th covariate in the l-th covariate configuration.
After that, the matrix "A" should be given. This is a "k"-by-"k" matrix of values
("k" records of "k" values each), as described above.
If the matrix "A" is assumed to be known without error, then this completes the
control card file. Otherwise, one continues as follows.
There should be given a series of "k"-by-"k" matrices, "nq" such matrices in all. The l-th
such matrix should give the derivative of "A" with respect to each of the parameters omega_l.
Finally, there should be given a "nq"-by-"nq" matrix ("nq" records of "nq" values each)
containing the elements of the covariance matrix for the estimated omega_l's.
As an illustration, we present the control card file for the example presented in the
paper of Zucker and Spiegelman. The example relates to data from the Nurses Health Study
regarding the relationship between average daily trans-unsaturated fatty acid (TFA) consumption
(g/day) and cardiovascular disease. The data consist of observations on 80,052 female nurses
who underwent dietary assessment using a food frequency questionnaire (FFQ) in 1980 and were
followed up to June 1, 1994 for cardiovascular events. In the analysis presented in the paper,
TFA was expressed in terms of a binary "high TFA" risk factor variable defined as 1 for subjects
in the 5th TFA quintile and 0 for the others. There was also adjustment for age in 1980 via strata
defined as $< 45$, 45-49, 50-54, 55+. Thus, our model will have four regression coefficients,
and three dummy variables for the age strata (X_1, X_2, X_3) and one for the binary risk factor
(X_4). There are 4 x 2 = 8 possible covariate configurations.
The relevant estimated classification probabilities used were omega_1 = Pr(X_4=0|Z_4=0)=0.8406
and omega_2 = Pr(X_4=1|Z_4=1)=0.3429 (X = true value, Z = observed value). The corresponding
estimated variances were 0.0009711 and 0.0064370. The estimates of omega_1 and omega_2 are assumed
independent. The age stratum indicators are assumed to be measured without misclassification. This
leads to a block diagonal structure for the matrix "A".
The control card file used for this example (with the iterative method) is as follows.
80052 8 4 2
1 0 0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0
1.0 0.0 0.0 1.0
0.0 1.0 0.0 0.0
0.0 1.0 0.0 1.0
0.0 0.0 1.0 0.0
0.0 0.0 1.0 1.0
0.8406 0.1594 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.6571 0.3429 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.8406 0.1594 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.6571 0.3429 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.8406 0.1594 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.6571 0.3429 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8406 0.1594
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6571 0.3429
1.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 -1.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
-1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 -1.0 1.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 -1.0 1.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 -1.0 1.0
0.0009711 0.0000000
0.0000000 0.0064370