Re: [R] coxph and ordinal variables?
Hi, everybody On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan minhan.scie...@gmail.com wrote: David said my R code text attachment got rejected by the mailing list. Pooh. I don't think that's nice. I don't see anything in the posting guide about a limit on text attachments. Well, if you are still trying to understand what 'orthogonal polynomial' means, I suggest you run the following through. I thought it was an enlightening experience. # Paul Johnson paulj...@ku.edu Nov. 16, 2005 # Ordinal predictors with a small number of possible values # Here is R code and commentary about ordinal predictors. # Please let me know if you have insights to explain this more clearly. set.seed(19) sampleSize - 100 subgroupSize - sampleSize/5 # One of those 5 point opinion questions: Strongly Disagree... # Strongly agree with values assigned 1,3,5,7,9 surveyQuestion - c(rep(1,subgroupSize),rep(3,subgroupSize),rep(5,subgroupSize),rep(7,subgroupSize),rep(9,subgroupSize)) ### Make this an unordered factor myufac - factor(surveyQuestion) ### Study the contrasts: contrasts(myufac) ### Make an ordered factor myofac - ordered(surveyQuestion) contrasts(myofac) # another independent variable x - rnorm(sampleSize) # a random error e - rnorm(sampleSize) # First, suppose the output data is really just a # linear reflection of the surveyQuestion. It is created # by multiplying against the evenly spaced values # observed in the survey y1 - -5 + 4*surveyQuestion + 6*x + 10 * e mod0 - lm(y1~x + surveyQuestion) summary(mod0) # Question: are you making a mistake by just throwing # surveyQuestion into the analysis? You should not be # making a mistake, because you (luckily) guessed the right model # You might check by running the model with the unordered factor, # which amounts to running dummy variables mod1 - lm(y1~x + myufac) summary(mod1) # or with the ordered factor, which estimates the orthogonal # polynomials mod2 - lm(y1~x + myofac) summary(mod2) # Does either of those model appear to be better than # the original mod0? # If I did not know for sure the factor was ordered, I'd # be stuck with the treatment contrasts in mod1. And what # I would really like to know is this: are the coefficients # in mod1 stepping up in a clear, ordinal, evenly spaced pattern? # Since we know the data is created with a coefficient of 4 # we would expect that the coefficients would step up like this # myufac3 8 # myufac5 16 # myufac7 24 # myufac9 32 # See why we expect this pattern? The intercept gobbles up myufac1, # so each impact of the surveyQuestion has to be reduced by 4 units. # The impact of myufac3, which you expect might be 3*4=12, is reduced # to 3*4 - 4 = 8, and so forth. # But that does not happen with a smallish sample. You can run this # code a few times. It appears to me that a sample of more than # 10,000 is necessary to get any faithful reproduction of the steps # between values of surveyQuestion. # Question: would we mistakenly think that the uneveness observed in # summary(mod1) is evidence of the need to treat surveyQuestion as a # nominal factor, even though we know (since we created the data) that # we might as well just throw surveyQuestion in as a single variable? # How to decide? # We need a hypothesis test of the conjecture that # the coefficient estimates in mod1 fall along a line # i.e, myufac-i = (b2 * i ) - b2 # I believe that the correct test results from this command: anova(mod0, mod1, test=Chisq) # If that is significant, it means you are losing predictive # power by throwing in surveyQuestion as if it were a numerical # variable. # Now, what if we are sure that the data gathered in surveyQuestion is # really ordinal, and so we estimate mod2. # What do those parameters mean? Here I'm just reasoning/guessing # because I cannot find any complete idiot's guide to orthogonal # polynomials and their use in regression and/or R. # Take a look at the contrasts themselves # ord1 - contrasts(myofac) # ord1 # .L .Q.C ^4 # 1 -6.324555e-01 0.5345225 -3.162278e-01 0.1195229 # 3 -3.162278e-01 -0.2672612 6.324555e-01 -0.4780914 # 5 -3.287978e-17 -0.5345225 1.595204e-16 0.7171372 # 7 3.162278e-01 -0.2672612 -6.324555e-01 -0.4780914 # 9 6.324555e-01 0.5345225 3.162278e-01 0.1195229 # What does this mean? I believe: we act as though there are 4 # independent variables in the regression, L, Q, C, and ^4, and for # each respondent in the dataset, we choose a row of these numbers # to act as independent variables. A person who # answered 1 on the survey would have the input values (-.63, -.53, # -.31, 0.11) for those four variables. # This begs the question, what are L, Q, C, and ^4? ### This is tough to explain. # Background Recall from calculus that any function can be # approximated by a polynomial. Since surveyQuestion has only 5 # possible values, a polynomial of degree 4 is needed to replace it # in a regression model. It can
Re: [R] coxph and ordinal variables?
On Wed, 8 Sep 2010, Paul Johnson wrote: run it with factor() instead of ordered(). You don't want the orthogonal polynomial contrasts that result from ordered if you need to compare against Stata. If you don't want polynomial contrasts for ordered factors, you can just tell R not to use them. options(contrasts=c(contr.treatment,contr.treatment)) It's like the Good Old Days when you had to use options() to tell S-PLUS not to use Helmert contrasts. -thomas Thomas Lumley Professor of Biostatistics University of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] coxph and ordinal variables?
Dear R-help members, Apologies - I am posting on behalf of a colleague, who is a little puzzled as STATA and R seem to be yielding different survival estimates for the same dataset when treating a variable as ordinal. Ordered() is used to represent an ordinal variable) I understand that R's coxph (by default) uses the Efron approximation, whereas STATA uses (by default) the Breslow. but we did compare using the same approximations. I am wondering if this is a result of how coxph manages an ordered factor? Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as the risk factor. This is more of an 'ordinal' variable, rather than a continuous variable. For the same data set of 399 patients, when treating the vector of tumor grade as a continuous variable (range of 1 to 4), testing the Efron and the Breslow approximations yield the same result in both R and STATA. However, when Hist_Grade_4 grp is converted into an ordered factor using ordered(), and the same scripts are applied, rather different results are obtained, relative to the STATA output. This is tested across the different approximations, with consistent results. The comparison using Efron approximation and ordinal data is is below. Your advice is very much appreciated! Min-Han Apologies below for the slightly malaligned output. STATA output . xi:stcox i.Hist_Grade_4grp, efr i.Hist_Grade_~p _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1 omitted) failure _d: FFR_censor analysis time _t: FFR_month Iteration 0: log likelihood = -1133.369 Iteration 1: log likelihood = -1129.4686 Iteration 2: log likelihood = -1129.3196 Iteration 3: log likelihood = -1129.3191 Refining estimates: Iteration 0: log likelihood = -1129.3191 Cox regression -- Efron method for ties No. of subjects = 399 Number of obs = 399 No. of failures = 218 Time at risk= 9004.484606 LR chi2(3) = 8.10 Log likelihood = -1129.3191 Prob chi2 = 0.0440 -- _t | Haz. Ratio Std. Err. zP|z| [95% Conf. Interval] -+ _IHist_Gra~2 | 1.408166 .3166876 1.52 0.128 .9062001 2.188183 _IHist_Gra~3 |1.69506 .3886792 2.30 0.021 1.081443 2.656847 _IHist_Gra~4 | 2.540278 .9997843 2.37 0.018 1.17455 5.49403 R Output using summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(breslow))) summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(exact))) summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(efron))) n=399 (21 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(|z|) Hist_Grade_4grp.L 0.66685 1.94809 0.26644 2.503 0.0123 * Hist_Grade_4grp.Q 0.03113 1.03162 0.20842 0.149 0.8813 Hist_Grade_4grp.C 0.08407 1.08771 0.13233 0.635 0.5252 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 exp(coef) exp(-coef) lower .95 upper .95 Hist_Grade_4grp.L 1.948 0.51331.1556 3.284 Hist_Grade_4grp.Q 1.032 0.96930.6857 1.552 Hist_Grade_4grp.C 1.088 0.91940.8392 1.410 Rsquare= 0.02 (max possible= 0.997 ) Likelihood ratio test= 8.1 on 3 df, p=0.044 Wald test= 8.02 on 3 df, p=0.0455 Score (logrank) test = 8.2 on 3 df, p=0.04202 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxph and ordinal variables?
On Sep 8, 2010, at 6:43 PM, Min-Han Tan wrote: Dear R-help members, Apologies - I am posting on behalf of a colleague, who is a little puzzled as STATA and R seem to be yielding different survival estimates for the same dataset when treating a variable as ordinal. Ordered() is used to represent an ordinal variable) I understand that R's coxph (by default) uses the Efron approximation, whereas STATA uses (by default) the Breslow. but we did compare using the same approximations. I am wondering if this is a result of how coxph manages an ordered factor? Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as the risk factor. This is more of an 'ordinal' variable, rather than a continuous variable. For the same data set of 399 patients, when treating the vector of tumor grade as a continuous variable (range of 1 to 4), testing the Efron and the Breslow approximations yield the same result in both R and STATA. However, when Hist_Grade_4 grp is converted into an ordered factor using ordered(), and the same scripts are applied, rather different results are obtained, relative to the STATA output. This is tested across the different approximations, with consistent results. The comparison using Efron approximation and ordinal data is is below. Are you sure you want an ordered factor? In R this means you will be creating linear, quadratic and cubic contrasts. Notice the L, Q and C designations on the coefficients. That certainly does not look to be comparable to what you are getting from Stata. My suggestion would be to create an un-ordered factor in R and see whether you get results more in line with Stata's output when applied to your data. -- David. Your advice is very much appreciated! Min-Han Apologies below for the slightly malaligned output. STATA output . xi:stcox i.Hist_Grade_4grp, efr i.Hist_Grade_~p _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1 omitted) failure _d: FFR_censor analysis time _t: FFR_month Iteration 0: log likelihood = -1133.369 Iteration 1: log likelihood = -1129.4686 Iteration 2: log likelihood = -1129.3196 Iteration 3: log likelihood = -1129.3191 Refining estimates: Iteration 0: log likelihood = -1129.3191 Cox regression -- Efron method for ties No. of subjects = 399 Number of obs = 399 No. of failures = 218 Time at risk= 9004.484606 LR chi2(3) = 8.10 Log likelihood = -1129.3191 Prob chi2 = 0.0440 -- _t | Haz. Ratio Std. Err. zP|z| [95% Conf. Interval] - + _IHist_Gra~2 | 1.408166 .3166876 1.52 0.128 .9062001 2.188183 _IHist_Gra~3 |1.69506 .3886792 2.30 0.021 1.081443 2.656847 _IHist_Gra~4 | 2.540278 .9997843 2.37 0.018 1.17455 5.49403 R Output using summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(breslow))) summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(exact))) summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(efron))) n=399 (21 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(|z|) Hist_Grade_4grp.L 0.66685 1.94809 0.26644 2.503 0.0123 * Hist_Grade_4grp.Q 0.03113 1.03162 0.20842 0.149 0.8813 Hist_Grade_4grp.C 0.08407 1.08771 0.13233 0.635 0.5252 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 exp(coef) exp(-coef) lower .95 upper .95 Hist_Grade_4grp.L 1.948 0.51331.1556 3.284 Hist_Grade_4grp.Q 1.032 0.96930.6857 1.552 Hist_Grade_4grp.C 1.088 0.91940.8392 1.410 Rsquare= 0.02 (max possible= 0.997 ) Likelihood ratio test= 8.1 on 3 df, p=0.044 Wald test= 8.02 on 3 df, p=0.0455 Score (logrank) test = 8.2 on 3 df, p=0.04202 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxph and ordinal variables?
run it with factor() instead of ordered(). You don't want the orthogonal polynomial contrasts that result from ordered if you need to compare against Stata. I attach an R program that I wrote to explore ordered factors a while agol I believe this will clear everything up if you study the examples. pj On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan minhan.scie...@gmail.com wrote: Dear R-help members, Apologies - I am posting on behalf of a colleague, who is a little puzzled as STATA and R seem to be yielding different survival estimates for the same dataset when treating a variable as ordinal. Ordered() is used to represent an ordinal variable) I understand that R's coxph (by default) uses the Efron approximation, whereas STATA uses (by default) the Breslow. but we did compare using the same approximations. I am wondering if this is a result of how coxph manages an ordered factor? Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as the risk factor. This is more of an 'ordinal' variable, rather than a continuous variable. For the same data set of 399 patients, when treating the vector of tumor grade as a continuous variable (range of 1 to 4), testing the Efron and the Breslow approximations yield the same result in both R and STATA. However, when Hist_Grade_4 grp is converted into an ordered factor using ordered(), and the same scripts are applied, rather different results are obtained, relative to the STATA output. This is tested across the different approximations, with consistent results. The comparison using Efron approximation and ordinal data is is below. Your advice is very much appreciated! Min-Han Apologies below for the slightly malaligned output. STATA output . xi:stcox i.Hist_Grade_4grp, efr i.Hist_Grade_~p _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1 omitted) failure _d: FFR_censor analysis time _t: FFR_month Iteration 0: log likelihood = -1133.369 Iteration 1: log likelihood = -1129.4686 Iteration 2: log likelihood = -1129.3196 Iteration 3: log likelihood = -1129.3191 Refining estimates: Iteration 0: log likelihood = -1129.3191 Cox regression -- Efron method for ties No. of subjects = 399 Number of obs = 399 No. of failures = 218 Time at risk = 9004.484606 LR chi2(3) = 8.10 Log likelihood = -1129.3191 Prob chi2 = 0.0440 -- _t | Haz. Ratio Std. Err. z P|z| [95% Conf. Interval] -+ _IHist_Gra~2 | 1.408166 .3166876 1.52 0.128 .9062001 2.188183 _IHist_Gra~3 | 1.69506 .3886792 2.30 0.021 1.081443 2.656847 _IHist_Gra~4 | 2.540278 .9997843 2.37 0.018 1.17455 5.49403 R Output using summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(breslow))) summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(exact))) summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp, method=c(efron))) n=399 (21 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(|z|) Hist_Grade_4grp.L 0.66685 1.94809 0.26644 2.503 0.0123 * Hist_Grade_4grp.Q 0.03113 1.03162 0.20842 0.149 0.8813 Hist_Grade_4grp.C 0.08407 1.08771 0.13233 0.635 0.5252 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 Hist_Grade_4grp.L 1.948 0.5133 1.1556 3.284 Hist_Grade_4grp.Q 1.032 0.9693 0.6857 1.552 Hist_Grade_4grp.C 1.088 0.9194 0.8392 1.410 Rsquare= 0.02 (max possible= 0.997 ) Likelihood ratio test= 8.1 on 3 df, p=0.044 Wald test = 8.02 on 3 df, p=0.0455 Score (logrank) test = 8.2 on 3 df, p=0.04202 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxph and ordinal variables?
I look at this question in a different angle. My understanding is: 1. If treat tumor_grade as a numerical variable, you assume the hazard ratio is invariant between any two adjacent levels of the tumor grade (assuming invariant covariate patterns of other risks); 2. If you treat the tumor_grade as factor, you don't assume the constant hazard ratio between the levels (assuming invariant covariate patterns of other risks). I don't see the difference between continuous risk and discrete ordinal risk in this case. If the tumor grade is the response, there will be a difference between variables types (ordinal, continuous and nominal) since that information will be used to select appropriate models. -- View this message in context: http://r.789695.n4.nabble.com/coxph-and-ordinal-variables-tp2532149p2532299.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.