PROGRAM FOR COX MODEL ANALYSIS WITH MISCLASSIFIED DISCRETE COVARIATES Reference: Zucker, D.M. and Spiegelman, D. (2004). Inference for the proportional hazards model with misclassified discrete-valued covariates. Biometrics 60:324-334. This note describes how to run a Fortran program that implements the method presented in the above-cited reference. This is a method for Cox regression analysis with discrete covariates subject to possible misclassification. The setup is as follows. There are "nb" discrete covariates, subject to possible misclassification. There are "k" possible configurations of the covariate vector (we include only those configurations that actually appear in the data set). What is observed is not the true covariate vector, but an error-prone surrogate. It is assumed that the surrogate has the same range of levels as the true covariate. We have a matrix "A" that describes the classification probabilities: A(i,j) is the probability that the true configuration is the j-th one given that the observed configuration is the i-th one. The matrix "A" typically is a function of some number "nq" of parameters, call them omega_l. If the matrix "A" is assumed to be known without error, then "nq" is taken equal to 0. There are two versions of the program. The basic program consists of two Fortran files: a driver (donrunA.f) and a subroutine file (donsubsA.f). The extended program consists of two similar files, except with "B" instead of "A" in the filenames. The extended program covers an extension to left truncation (which is not discussed in the original paper). The extended program runs somewhat more slowly than the basic program. To run the program, it is necessary to prepare two files, the main data file and a "control card" file. The main file must be called input.dat. The control card file must be called cntlr.crd. The results are output to donna.out. The format of the main data file is as follows. There is one record for each subject. Each record consists of the following elements. 1. The covariate configuration (stratum) associated with the subject (this will be an integer from 1 to "k"). 2. A 0-1 indicator variable for whether the subject had an event (1) or was censored (0). 3. The time (e.g. age) at which the subject entered the cohort. This is to accomodate left truncation, which is common in studies where age is the time variable. If there is no left truncation at all, then this variable should be set equal to 0 for all subjects. (If the basic program is being run rather than the extended program, this column should be OMITTED.) 4. The time at which the subject left the cohort (either due to an event or to censoring). The format of the "control card" file is as follows. The first line should contain the number of subjects in the data set and the values of "k", "nb", and "nq" as defined above. These should be in integer format, and there should be a space between each number. The second line should contain the following numbers (the first three are either 0 or 1, the last is a nonnegative real number): iopt, iropt, icpr, zval. There should be a space between each number. Zucker and Spiegelman describe a non-iterative estimator and an iterative estimator. The value of iopt should be 0 if the non-iterative estimator is desired and 1 if the iterative estimator is desired. The variable icpr is a 0-1 flag for whether to print out the full covariance matrix of the estimated regression parameters (icpr=1) or just the standard errors of the individual coefficients (icpr=0). The variables iropt and zval relate to the possibility of introducing a certain adjustment in the estimation calculation. For most working purposes, it is recommended to set iropt=0 and zval=0.0. There then should follow "k" records for the "k" covariate configurations. Each record should contain "nb" elements (columns). In the l-th record, the m-th element should be the value of the m-th covariate in the l-th covariate configuration. After that, the matrix "A" should be given. This is a "k"-by-"k" matrix of values ("k" records of "k" values each), as described above. If the matrix "A" is assumed to be known without error, then this completes the control card file. Otherwise, one continues as follows. There should be given a series of "k"-by-"k" matrices, "nq" such matrices in all. The l-th such matrix should give the derivative of "A" with respect to each of the parameters omega_l. Finally, there should be given a "nq"-by-"nq" matrix ("nq" records of "nq" values each) containing the elements of the covariance matrix for the estimated omega_l's. As an illustration, we present the control card file for the example presented in the paper of Zucker and Spiegelman. The example relates to data from the Nurses Health Study regarding the relationship between average daily trans-unsaturated fatty acid (TFA) consumption (g/day) and cardiovascular disease. The data consist of observations on 80,052 female nurses who underwent dietary assessment using a food frequency questionnaire (FFQ) in 1980 and were followed up to June 1, 1994 for cardiovascular events. In the analysis presented in the paper, TFA was expressed in terms of a binary "high TFA" risk factor variable defined as 1 for subjects in the 5th TFA quintile and 0 for the others. There was also adjustment for age in 1980 via strata defined as $< 45$, 45-49, 50-54, 55+. Thus, our model will have four regression coefficients, and three dummy variables for the age strata (X_1, X_2, X_3) and one for the binary risk factor (X_4). There are 4 x 2 = 8 possible covariate configurations. The relevant estimated classification probabilities used were omega_1 = Pr(X_4=0|Z_4=0)=0.8406 and omega_2 = Pr(X_4=1|Z_4=1)=0.3429 (X = true value, Z = observed value). The corresponding estimated variances were 0.0009711 and 0.0064370. The estimates of omega_1 and omega_2 are assumed independent. The age stratum indicators are assumed to be measured without misclassification. This leads to a block diagonal structure for the matrix "A". The control card file used for this example (with the iterative method) is as follows. 80052 8 4 2 1 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.8406 0.1594 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6571 0.3429 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8406 0.1594 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6571 0.3429 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8406 0.1594 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6571 0.3429 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8406 0.1594 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6571 0.3429 1.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0009711 0.0000000 0.0000000 0.0064370