Re: [R] coxph and ordinal variables?

2010-09-10 Thread Paul Johnson
Hi, everybody

On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan minhan.scie...@gmail.com wrote:


David said my R code text attachment got rejected by the mailing list.

Pooh.   I don't think that's nice.  I don't see anything in the
posting guide about a limit on text attachments.

Well, if you are still trying to understand what 'orthogonal
polynomial' means, I suggest you run the following through.  I thought
it was an
enlightening experience.

# Paul Johnson paulj...@ku.edu Nov. 16, 2005

# Ordinal predictors with a small number of possible values

# Here is R code and commentary about ordinal predictors.
# Please let me know if you have insights to explain this more clearly.

set.seed(19)

sampleSize - 100
subgroupSize - sampleSize/5
# One of those 5 point opinion questions: Strongly Disagree...
# Strongly agree with values assigned 1,3,5,7,9
surveyQuestion -
c(rep(1,subgroupSize),rep(3,subgroupSize),rep(5,subgroupSize),rep(7,subgroupSize),rep(9,subgroupSize))

### Make this an unordered factor
myufac - factor(surveyQuestion)
### Study the contrasts:
contrasts(myufac)

### Make an ordered  factor
myofac - ordered(surveyQuestion)
contrasts(myofac)

# another independent variable
x - rnorm(sampleSize)
# a random error
e - rnorm(sampleSize)


# First, suppose the output data is really just a
# linear reflection of the surveyQuestion. It is created
# by multiplying against the evenly spaced values
# observed in the survey

y1 - -5 +  4*surveyQuestion + 6*x + 10 * e


mod0 - lm(y1~x + surveyQuestion)
summary(mod0)

# Question: are you making a mistake by just throwing
# surveyQuestion into the analysis? You should not be
# making a mistake, because you (luckily) guessed the right model

# You might check by running the model with the unordered factor,
# which amounts to running dummy variables

mod1 - lm(y1~x + myufac)
summary(mod1)

# or with the ordered factor, which estimates the orthogonal
# polynomials

mod2 - lm(y1~x + myofac)
summary(mod2)

# Does either of those model appear to be better than
# the original mod0?

# If I did not know for sure the factor was ordered, I'd
# be stuck with the treatment contrasts in mod1.  And what
# I would really like to know is this: are the coefficients
# in mod1 stepping up in a clear, ordinal, evenly spaced pattern?

# Since we know the data is created with a coefficient of 4
# we would expect that the coefficients would step up like this
# myufac3   8
# myufac5   16
# myufac7   24
# myufac9   32

# See why we expect this pattern? The intercept gobbles up myufac1,
# so each impact of the surveyQuestion has to be reduced by 4 units.
# The impact of myufac3, which you expect might be 3*4=12, is reduced
# to 3*4 - 4 = 8, and so forth.

# But that does not happen with a smallish sample. You can run this
# code a few times. It appears to me that a sample of more than
# 10,000 is necessary to get any faithful reproduction of the steps
# between values of surveyQuestion.

# Question: would we mistakenly think that the uneveness observed in
# summary(mod1) is evidence of the need to treat surveyQuestion as a
# nominal factor, even though we know (since we created the data) that
# we might as well just throw surveyQuestion in as a single variable?
# How to decide?

# We need a hypothesis test of the conjecture that
# the coefficient estimates in mod1 fall along a line
# i.e, myufac-i = (b2 * i ) - b2

# I believe that the correct test results from this command:

anova(mod0, mod1, test=Chisq)

# If that is significant, it means you are losing predictive
# power by throwing in surveyQuestion as if it were a numerical
# variable.



# Now, what if we are sure that the data gathered in surveyQuestion is
# really ordinal, and so we estimate mod2.

# What do those parameters mean? Here I'm just reasoning/guessing
# because I cannot find any complete idiot's guide to orthogonal
# polynomials and their use in regression and/or R.

# Take a look at the contrasts themselves
#  ord1 - contrasts(myofac)
# ord1
#   .L .Q.C ^4
# 1 -6.324555e-01  0.5345225 -3.162278e-01  0.1195229
# 3 -3.162278e-01 -0.2672612  6.324555e-01 -0.4780914
# 5 -3.287978e-17 -0.5345225  1.595204e-16  0.7171372
# 7  3.162278e-01 -0.2672612 -6.324555e-01 -0.4780914
# 9  6.324555e-01  0.5345225  3.162278e-01  0.1195229

# What does this mean?  I believe: we act as though there are 4
# independent variables in the regression, L, Q, C, and ^4, and for
# each respondent in the dataset, we choose a row of these numbers
# to act as independent variables.  A person who
# answered 1 on the survey would have the input values (-.63, -.53,
# -.31, 0.11) for those four variables.

# This begs the question, what are L, Q, C, and ^4?

### This is tough to explain.

# Background Recall from calculus that any function can be
# approximated by a polynomial.  Since surveyQuestion has only 5
# possible values, a polynomial of degree 4 is needed to replace it
# in a regression model. It can 

Re: [R] coxph and ordinal variables?

2010-09-09 Thread Thomas Lumley

On Wed, 8 Sep 2010, Paul Johnson wrote:


run it with factor() instead of ordered().  You don't want the
orthogonal polynomial contrasts that result from ordered if you need
to compare against Stata.


If you don't want polynomial contrasts for ordered factors, you can just tell R 
not to use them.

options(contrasts=c(contr.treatment,contr.treatment))

It's like the Good Old Days when you had to use options() to tell S-PLUS not to 
use Helmert contrasts.


   -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] coxph and ordinal variables?

2010-09-08 Thread Min-Han Tan
Dear R-help members,

Apologies - I am posting on behalf of a colleague, who is a little puzzled
as STATA and R seem to be yielding different survival estimates for the same
dataset when treating a variable as ordinal. Ordered() is used  to represent
an ordinal variable) I understand that R's coxph (by default) uses the Efron
approximation, whereas STATA uses (by default) the Breslow. but we did
compare using the same approximations. I am wondering if this is a result of
how coxph manages an ordered factor?

Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as
the risk factor. This is more of an 'ordinal' variable, rather than a
continuous variable. For the same data set of 399 patients, when treating
the vector of tumor grade as a continuous variable (range of 1 to 4),
testing the Efron and the Breslow approximations yield the same result in
both R and STATA.

However, when Hist_Grade_4 grp is converted into an ordered factor using
ordered(), and the same scripts are applied, rather different results are
obtained, relative to the STATA output. This is tested across the different
approximations, with consistent results. The comparison using Efron
approximation and ordinal data is is below.

Your advice is very much appreciated!

Min-Han

Apologies below for the slightly malaligned output.

STATA output

. xi:stcox i.Hist_Grade_4grp, efr
i.Hist_Grade_~p   _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1
omitted)

failure _d:  FFR_censor
  analysis time _t:  FFR_month

Iteration 0:   log likelihood =  -1133.369
Iteration 1:   log likelihood = -1129.4686
Iteration 2:   log likelihood = -1129.3196
Iteration 3:   log likelihood = -1129.3191
Refining estimates:
Iteration 0:   log likelihood = -1129.3191

Cox regression -- Efron method for ties

No. of subjects =  399 Number of obs   =
399
No. of failures =  218
Time at risk=  9004.484606
  LR chi2(3)  =
 8.10
Log likelihood  =   -1129.3191 Prob  chi2 =
 0.0440

--
 _t | Haz. Ratio   Std. Err.  zP|z| [95% Conf.
Interval]
-+
_IHist_Gra~2 |   1.408166   .3166876 1.52   0.128 .9062001
 2.188183
_IHist_Gra~3 |1.69506   .3886792 2.30   0.021 1.081443
 2.656847
_IHist_Gra~4 |   2.540278   .9997843 2.37   0.018  1.17455
5.49403



R Output using
 summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,
method=c(breslow)))
 summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,
method=c(exact)))
 summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,
method=c(efron)))



 n=399 (21 observations deleted due to missingness)

coef exp(coef) se(coef) z Pr(|z|)
Hist_Grade_4grp.L 0.66685   1.94809  0.26644 2.503   0.0123 *
Hist_Grade_4grp.Q 0.03113   1.03162  0.20842 0.149   0.8813
Hist_Grade_4grp.C 0.08407   1.08771  0.13233 0.635   0.5252
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 exp(coef) exp(-coef) lower .95 upper .95
Hist_Grade_4grp.L 1.948 0.51331.1556 3.284
Hist_Grade_4grp.Q 1.032 0.96930.6857 1.552
Hist_Grade_4grp.C 1.088 0.91940.8392 1.410

Rsquare= 0.02   (max possible= 0.997 )
Likelihood ratio test= 8.1  on 3 df,   p=0.044
Wald test= 8.02  on 3 df,   p=0.0455
Score (logrank) test = 8.2  on 3 df,   p=0.04202

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coxph and ordinal variables?

2010-09-08 Thread David Winsemius


On Sep 8, 2010, at 6:43 PM, Min-Han Tan wrote:


Dear R-help members,

Apologies - I am posting on behalf of a colleague, who is a little  
puzzled
as STATA and R seem to be yielding different survival estimates for  
the same
dataset when treating a variable as ordinal. Ordered() is used  to  
represent
an ordinal variable) I understand that R's coxph (by default) uses  
the Efron

approximation, whereas STATA uses (by default) the Breslow. but we did
compare using the same approximations. I am wondering if this is a  
result of

how coxph manages an ordered factor?

Essentially, this is a survival dataset using tumor grade (1, 2, 3  
and 4) as

the risk factor. This is more of an 'ordinal' variable, rather than a
continuous variable. For the same data set of 399 patients, when  
treating

the vector of tumor grade as a continuous variable (range of 1 to 4),
testing the Efron and the Breslow approximations yield the same  
result in

both R and STATA.

However, when Hist_Grade_4 grp is converted into an ordered factor  
using
ordered(), and the same scripts are applied, rather different  
results are
obtained, relative to the STATA output. This is tested across the  
different

approximations, with consistent results. The comparison using Efron
approximation and ordinal data is is below.


Are you sure you want an ordered factor? In R this means you will be  
creating linear, quadratic and cubic contrasts. Notice the L, Q and C  
designations on the coefficients. That certainly does not look to be  
comparable to what you are getting from Stata. My suggestion would be  
to create an un-ordered factor in R and see whether you get results  
more in line with Stata's output when applied to your data.


--
David.


Your advice is very much appreciated!

Min-Han

Apologies below for the slightly malaligned output.

STATA output

. xi:stcox i.Hist_Grade_4grp, efr
i.Hist_Grade_~p   _IHist_Grad_1-4 (naturally coded; _IHist_Grad_1
omitted)

   failure _d:  FFR_censor
 analysis time _t:  FFR_month

Iteration 0:   log likelihood =  -1133.369
Iteration 1:   log likelihood = -1129.4686
Iteration 2:   log likelihood = -1129.3196
Iteration 3:   log likelihood = -1129.3191
Refining estimates:
Iteration 0:   log likelihood = -1129.3191

Cox regression -- Efron method for ties

No. of subjects =  399 Number of obs   =
399
No. of failures =  218
Time at risk=  9004.484606
 LR chi2(3)  =
8.10
Log likelihood  =   -1129.3191 Prob  chi2 =
0.0440

--
_t | Haz. Ratio   Std. Err.  zP|z| [95% Conf.
Interval]
- 
+

_IHist_Gra~2 |   1.408166   .3166876 1.52   0.128 .9062001
2.188183
_IHist_Gra~3 |1.69506   .3886792 2.30   0.021 1.081443
2.656847
_IHist_Gra~4 |   2.540278   .9997843 2.37   0.018  1.17455
5.49403



R Output using

summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,

method=c(breslow)))

summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,

method=c(exact)))

summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,

method=c(efron)))



n=399 (21 observations deleted due to missingness)

   coef exp(coef) se(coef) z Pr(|z|)
Hist_Grade_4grp.L 0.66685   1.94809  0.26644 2.503   0.0123 *
Hist_Grade_4grp.Q 0.03113   1.03162  0.20842 0.149   0.8813
Hist_Grade_4grp.C 0.08407   1.08771  0.13233 0.635   0.5252
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
Hist_Grade_4grp.L 1.948 0.51331.1556 3.284
Hist_Grade_4grp.Q 1.032 0.96930.6857 1.552
Hist_Grade_4grp.C 1.088 0.91940.8392 1.410

Rsquare= 0.02   (max possible= 0.997 )
Likelihood ratio test= 8.1  on 3 df,   p=0.044
Wald test= 8.02  on 3 df,   p=0.0455
Score (logrank) test = 8.2  on 3 df,   p=0.04202

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coxph and ordinal variables?

2010-09-08 Thread Paul Johnson
run it with factor() instead of ordered().  You don't want the
orthogonal polynomial contrasts that result from ordered if you need
to compare against Stata.

I attach an R program that I wrote to explore ordered factors a while
agol I believe this will clear everything up if you study the
examples.

pj

On Wed, Sep 8, 2010 at 5:43 PM, Min-Han Tan minhan.scie...@gmail.com wrote:
 Dear R-help members,

 Apologies - I am posting on behalf of a colleague, who is a little puzzled
 as STATA and R seem to be yielding different survival estimates for the same
 dataset when treating a variable as ordinal. Ordered() is used  to represent
 an ordinal variable) I understand that R's coxph (by default) uses the Efron
 approximation, whereas STATA uses (by default) the Breslow. but we did
 compare using the same approximations. I am wondering if this is a result of
 how coxph manages an ordered factor?

 Essentially, this is a survival dataset using tumor grade (1, 2, 3 and 4) as
 the risk factor. This is more of an 'ordinal' variable, rather than a
 continuous variable. For the same data set of 399 patients, when treating
 the vector of tumor grade as a continuous variable (range of 1 to 4),
 testing the Efron and the Breslow approximations yield the same result in
 both R and STATA.

 However, when Hist_Grade_4 grp is converted into an ordered factor using
 ordered(), and the same scripts are applied, rather different results are
 obtained, relative to the STATA output. This is tested across the different
 approximations, with consistent results. The comparison using Efron
 approximation and ordinal data is is below.

 Your advice is very much appreciated!

 Min-Han

 Apologies below for the slightly malaligned output.

 STATA output

 . xi:stcox i.Hist_Grade_4grp, efr
 i.Hist_Grade_~p   _IHist_Grad_1-4     (naturally coded; _IHist_Grad_1
 omitted)

        failure _d:  FFR_censor
  analysis time _t:  FFR_month

 Iteration 0:   log likelihood =  -1133.369
 Iteration 1:   log likelihood = -1129.4686
 Iteration 2:   log likelihood = -1129.3196
 Iteration 3:   log likelihood = -1129.3191
 Refining estimates:
 Iteration 0:   log likelihood = -1129.3191

 Cox regression -- Efron method for ties

 No. of subjects =          399                     Number of obs   =
 399
 No. of failures =          218
 Time at risk    =  9004.484606
                                                  LR chi2(3)      =
  8.10
 Log likelihood  =   -1129.3191                     Prob  chi2     =
  0.0440

 --
         _t | Haz. Ratio   Std. Err.      z    P|z|     [95% Conf.
 Interval]
 -+
 _IHist_Gra~2 |   1.408166   .3166876     1.52   0.128     .9062001
  2.188183
 _IHist_Gra~3 |    1.69506   .3886792     2.30   0.021     1.081443
  2.656847
 _IHist_Gra~4 |   2.540278   .9997843     2.37   0.018      1.17455
 5.49403



 R Output using
 summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,
 method=c(breslow)))
 summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,
 method=c(exact)))
 summary ( coxph( Surv(FFR_month,FFR_censor) ~ Hist_Grade_4grp,
 method=c(efron)))



  n=399 (21 observations deleted due to missingness)

                    coef exp(coef) se(coef)     z Pr(|z|)
 Hist_Grade_4grp.L 0.66685   1.94809  0.26644 2.503   0.0123 *
 Hist_Grade_4grp.Q 0.03113   1.03162  0.20842 0.149   0.8813
 Hist_Grade_4grp.C 0.08407   1.08771  0.13233 0.635   0.5252
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                 exp(coef) exp(-coef) lower .95 upper .95
 Hist_Grade_4grp.L     1.948     0.5133    1.1556     3.284
 Hist_Grade_4grp.Q     1.032     0.9693    0.6857     1.552
 Hist_Grade_4grp.C     1.088     0.9194    0.8392     1.410

 Rsquare= 0.02   (max possible= 0.997 )
 Likelihood ratio test= 8.1  on 3 df,   p=0.044
 Wald test            = 8.02  on 3 df,   p=0.0455
 Score (logrank) test = 8.2  on 3 df,   p=0.04202

        [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coxph and ordinal variables?

2010-09-08 Thread Peng, C

I look at this question in a different angle. My understanding is:

1. If treat tumor_grade as a numerical variable, you assume the hazard ratio
is invariant between any two adjacent levels of the tumor grade (assuming
invariant covariate patterns of other risks);
2. If you treat the tumor_grade as factor, you don't assume the constant
hazard ratio between the levels (assuming invariant covariate patterns of
other risks).

I don't see the difference between continuous risk and discrete ordinal risk
in this case. 
If the tumor grade is the response, there will be a difference between
variables types (ordinal, continuous and nominal) since that information
will be used to select appropriate models.

-- 
View this message in context: 
http://r.789695.n4.nabble.com/coxph-and-ordinal-variables-tp2532149p2532299.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.