[R] covariate selection?

2004-10-12 Thread Ian Fiske
Hello,
I am hoping someone can help me with the following multivariate issue:  
I have a model consisting of about 50 covariates.  I would like to 
reduce this to about 5 covariate for the reduced model by combining 
cofactors that are strongly correlated.  Is there a package or function 
that would help me with this in R?  I appreciate any suggestions.

Thanks,
Ian
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] covariate selection?

2004-10-12 Thread Martínez Ovando Juan Carlos
Hello Ian,

?princomp

If your covariates are scalars, and the following documents:

http://www.jstatsoft.org/v07/i01/drdoc.pdf

http://www.bioconductor.org/workshops/Milan/PDF/Lab12.pdf


Best wishes. 

Saludos,
 
Juan Carlos Martínez Ovando
Banco de México
Av. 5 de Mayo No. 18
Piso 5 Sección D
Col. Centro
06059  México, D. F.
Tel. +52 55 52.37.20.00 ext. 3594
Fax. +52 55 52.37.27.03
e-mail: [EMAIL PROTECTED]
 

-Mensaje original-
De: Ian Fiske [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 12 de Octubre de 2004 04:08 PM
Para: [EMAIL PROTECTED]
Asunto: [R] covariate selection?

Hello,

I am hoping someone can help me with the following multivariate issue:  
I have a model consisting of about 50 covariates.  I would like to 
reduce this to about 5 covariate for the reduced model by combining 
cofactors that are strongly correlated.  Is there a package or function 
that would help me with this in R?  I appreciate any suggestions.

Thanks,
Ian

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] covariate selection?

2004-10-12 Thread Ian Fiske
Thanks Juan.  I thought that was what I was looking for, but really, I 
want to know which of the original covariates could best be used to take 
advantage of their colinearity without creating new variables.  I think 
PCA creates new variables.  SAS and SPSS can do what I'm talking about, 
but I would like to use R for this.

Thanks,
Ian

Martínez Ovando Juan Carlos wrote:
Hello Ian,
?princomp
If your covariates are scalars, and the following documents:
http://www.jstatsoft.org/v07/i01/drdoc.pdf
http://www.bioconductor.org/workshops/Milan/PDF/Lab12.pdf
Best wishes. 

Saludos,
Juan Carlos Martínez Ovando
Banco de México
Av. 5 de Mayo No. 18
Piso 5 Sección D
Col. Centro
06059  México, D. F.
Tel. +52 55 52.37.20.00 ext. 3594
Fax. +52 55 52.37.27.03
e-mail: [EMAIL PROTECTED]
-Mensaje original-
De: Ian Fiske [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 12 de Octubre de 2004 04:08 PM
Para: [EMAIL PROTECTED]
Asunto: [R] covariate selection?

Hello,
I am hoping someone can help me with the following multivariate issue:  
I have a model consisting of about 50 covariates.  I would like to 
reduce this to about 5 covariate for the reduced model by combining 
cofactors that are strongly correlated.  Is there a package or function 
that would help me with this in R?  I appreciate any suggestions.

Thanks,
Ian
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] covariate selection?

2004-10-12 Thread Spencer Graves
 Have you considered stepwise regression, e.g., step or stepAIC 
in library(MASS)?  The documentation for both contain examples. 

 hope this helps. 
 spencer graves

Ian Fiske wrote:
Hello,
I am hoping someone can help me with the following multivariate 
issue:  I have a model consisting of about 50 covariates.  I would 
like to reduce this to about 5 covariate for the reduced model by 
combining cofactors that are strongly correlated.  Is there a package 
or function that would help me with this in R?  I appreciate any 
suggestions.

Thanks,
Ian
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

--
Spencer Graves, PhD, Senior Development Engineer
O:  (408)938-4420;  mobile:  (408)655-4567
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] covariate selection?

2004-10-12 Thread Martínez Ovando Juan Carlos
Hello Ian,

Sorry. I don't really understand your problem, which is of model selection. That's 
right? 

You could use some criteria based in likelihood. For instante Akaike (AIC) or Schwarz 
criteria (BIC), see: 

?AIC

?mle.aic

(The best model is determined minimizing AIC or BIC).

I hope this help you.

Greetings,
 
Juan Carlos 

-Mensaje original-
De: Ian Fiske [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 12 de Octubre de 2004 05:17 PM
Para: Martínez Ovando Juan Carlos
CC: [EMAIL PROTECTED]
Asunto: Re: [R] covariate selection?

Thanks Juan.  I thought that was what I was looking for, but really, I 
want to know which of the original covariates could best be used to take 
advantage of their colinearity without creating new variables.  I think 
PCA creates new variables.  SAS and SPSS can do what I'm talking about, 
but I would like to use R for this.

Thanks,
Ian



Martínez Ovando Juan Carlos wrote:

Hello Ian,

?princomp

If your covariates are scalars, and the following documents:

http://www.jstatsoft.org/v07/i01/drdoc.pdf

http://www.bioconductor.org/workshops/Milan/PDF/Lab12.pdf


Best wishes. 

Saludos,
 
Juan Carlos Martínez Ovando
Banco de México
Av. 5 de Mayo No. 18
Piso 5 Sección D
Col. Centro
06059  México, D. F.
Tel. +52 55 52.37.20.00 ext. 3594
Fax. +52 55 52.37.27.03
e-mail: [EMAIL PROTECTED]
 

-Mensaje original-
De: Ian Fiske [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 12 de Octubre de 2004 04:08 PM
Para: [EMAIL PROTECTED]
Asunto: [R] covariate selection?

Hello,

I am hoping someone can help me with the following multivariate issue:  
I have a model consisting of about 50 covariates.  I would like to 
reduce this to about 5 covariate for the reduced model by combining 
cofactors that are strongly correlated.  Is there a package or function 
that would help me with this in R?  I appreciate any suggestions.

Thanks,
Ian

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


  


__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] covariate selection?

2004-10-12 Thread Christian Mora
Hi Ian
Have you tried help.search(pca)?
Christian

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Martínez Ovando
Juan Carlos
Sent: Tuesday, October 12, 2004 7:56 PM
To: Ian Fiske
Cc: [EMAIL PROTECTED]
Subject: RE: [R] covariate selection?


Hello Ian,

Sorry. I don't really understand your problem, which is of model
selection. That's right? 

You could use some criteria based in likelihood. For instante Akaike
(AIC) or Schwarz criteria (BIC), see: 

?AIC

?mle.aic

(The best model is determined minimizing AIC or BIC).

I hope this help you.

Greetings,
 
Juan Carlos 

-Mensaje original-
De: Ian Fiske [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 12 de Octubre de 2004 05:17 PM
Para: Martínez Ovando Juan Carlos
CC: [EMAIL PROTECTED]
Asunto: Re: [R] covariate selection?

Thanks Juan.  I thought that was what I was looking for, but really, I 
want to know which of the original covariates could best be used to take

advantage of their colinearity without creating new variables.  I think 
PCA creates new variables.  SAS and SPSS can do what I'm talking about, 
but I would like to use R for this.

Thanks,
Ian



Martínez Ovando Juan Carlos wrote:

Hello Ian,

?princomp

If your covariates are scalars, and the following documents:

http://www.jstatsoft.org/v07/i01/drdoc.pdf

http://www.bioconductor.org/workshops/Milan/PDF/Lab12.pdf


Best wishes.

Saludos,
 
Juan Carlos Martínez Ovando
Banco de México
Av. 5 de Mayo No. 18
Piso 5 Sección D
Col. Centro
06059  México, D. F.
Tel. +52 55 52.37.20.00 ext. 3594
Fax. +52 55 52.37.27.03
e-mail: [EMAIL PROTECTED]
 

-Mensaje original-
De: Ian Fiske [mailto:[EMAIL PROTECTED]
Enviado el: Martes, 12 de Octubre de 2004 04:08 PM
Para: [EMAIL PROTECTED]
Asunto: [R] covariate selection?

Hello,

I am hoping someone can help me with the following multivariate issue:
I have a model consisting of about 50 covariates.  I would like to 
reduce this to about 5 covariate for the reduced model by combining 
cofactors that are strongly correlated.  Is there a package or function

that would help me with this in R?  I appreciate any suggestions.

Thanks,
Ian

__
[EMAIL PROTECTED] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html


  


__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] covariate selection?

2004-10-12 Thread Kjetil Brinchmann Halvorsen
Ian Fiske wrote:
Hello,
I am hoping someone can help me with the following multivariate 
issue:  I have a model consisting of about 50 covariates.  I would 
like to reduce this to about 5 covariate for the reduced model by 
combining cofactors that are strongly correlated.  Is there a package 
or function that would help me with this in R?  I appreciate any 
suggestions.

Thanks,
Ian
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html


have a look at package leaps, and also consider ridge regression.
--
Kjetil Halvorsen.
Peace is the most effective weapon of mass construction.
  --  Mahdi Elmandjra
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] covariate selection?

2004-10-12 Thread Austin, Matt
I like Kjetil's suggestion of a shrinkage estimator.  Perhaps this would be
a good time to experiment with Trevor Hastie's 'lars' package.

If you have a lot of correlated inputs I might suggest using Andy Liaw's
randomforest package.  I have found this technique to be very valuable in
this setting.  The partial dependency plots are a good way to explore the
functional relationships of the variables.

--Matt

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Kjetil Brinchmann
Halvorsen
Sent: Tuesday, October 12, 2004 17:16 PM
To: Ian Fiske
Cc: [EMAIL PROTECTED]
Subject: Re: [R] covariate selection?


Ian Fiske wrote:

 Hello,

 I am hoping someone can help me with the following multivariate 
 issue:  I have a model consisting of about 50 covariates.  I would 
 like to reduce this to about 5 covariate for the reduced model by 
 combining cofactors that are strongly correlated.  Is there a package 
 or function that would help me with this in R?  I appreciate any 
 suggestions.

 Thanks,
 Ian

 __
 [EMAIL PROTECTED] mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html


have a look at package leaps, and also consider ridge regression.

-- 

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
   --  Mahdi Elmandjra

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] covariate selection in cox model (counting process)

2004-07-28 Thread Thomas Lumley
On Wed, 28 Jul 2004, Mayeul KAUFFMANN wrote:

 No, I mean recurrent events.  With counting process notation but no
 recurrent revents the partial likelihood is still valid, and the approach
 of treating it as a real likelihood for AIC (and presumably BIC) makes
 sense.
 
 Roughly speaking, you can't tell there is dependence until you see
 multiple events.

 Thanks a lot, I got it (well, I hope so)!


 I've read in several places that events in the Andersen-Gill model must be
 conditionnaly independent, which is sometimes more precisely written as
 conditionnaly independent given the covariates

 or even more precisely:

 the Andersen-Gill (AG) model assumes that each [individual] has a
 multi-event counting process with independent increments. The observed
 increments must be conditionally independent given the history of all
 observable information up to the event times.
 (http://www.stat.umu.se/egna/danardono/licdd.pdf)

More precisely still, for the criterion function in coxph() to be a
partial likelihood the estimating function must be a martingale. This
is actually a slightly weaker assumption than independent increments.

The proportional rates model doesn't require this assumption, and is also
sometimes called the Andersen-Gill model.  The criterion function isn't a
likelihood but it still gives valid estimators.


 Then, there is still another option. In fact, I already modelled
 explicitely the influence of past events with a proximity of last event
 covariate, assuming the dependence on the last event decreases at a
 constant rate (for instance, the proximity covariate varies from 1 to 0.5
 in the first 10 years after an event, then from 0.5 to 0.25 in the next
 ten years, etc).

 With a well chosen modelisation of the dependence effect, the events
 become conditionnaly independent, I do not need a +cluster(id) term, and I
 can use fit$loglik to make a covariate selection based on BIC, right?

If you can get the conditional independence (martingaleness) then, yes,
BIC is fine.

One way to check might be to see how similar the standard errors are with
and without the cluster(id) term.

-thomas


 Thanks a lot again for your time.

 Mayeul KAUFFMANN
 Univ. Pierre Mendes France
 Grenoble - France

 PS: I wrongly concluded from the R statement (Note: the likelihood ratio
 and score tests assume independence of observations within a cluster, the
 Wald and robust score tests do not).  that it meant independence between
 two consecutive observations (without any event). It made sense to  me
 because when only one covariate changes for a given individual, and with a
 small change, there is a new observation, with a risk very simlar to the
 risk for the previous observation. But there is still independence with
 respect to the question of recurrent event. Maybe the warning should be
 rewritten saying assume *conditionnal* independence of *events* (given
 the covariates)

 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-28 Thread Mayeul KAUFFMANN
 If you can get the conditional independence (martingaleness) then, yes,
 BIC is fine.

 One way to check might be to see how similar the standard errors are
with
 and without the cluster(id) term.

(Thank you again !, Thomas.)

At first look, the values seemed very similar (see below, case 2).
However, to check this without being too subjective, and without a
specific test, I needed other values to assess the size of the
differences: what is similar, what is not?


==
=
CASE 1
I first estimated the model without modeling dependence:

Call:
coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
pop + pib + pib2 + crois + instab.x1  + instab.autres, data = xstep)


 coef exp(coef) se(coef) robust se z   p
pop0.3606 1.434   0.09780.1182  3.05 2.3e-03
pib   -0.5947 0.552   0.19520.1828 -3.25 1.1e-03
pib2  -0.4104 0.663   0.14520.1270 -3.23 1.2e-03
crois -0.0592 0.943   0.02450.0240 -2.46 1.4e-02
instab.x1  2.2059 9.079   0.46920.4097  5.38 7.3e-08
instab.autres  0.9550 2.599   0.47000.4936  1.93 5.3e-02

Likelihood ratio test=74  on 6 df, p=6.2e-14  n= 7286

There seems to be a strong linear relationship between standard errors
(se, or naive se) and robust se.

  summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))
Coefficients:
   Estimate Std. Error t value Pr(|t|)
sqrt(diag(cox1$naive.var))  0.961030.04064   23.65 2.52e-06 ***
Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894


==
=
CASE 2

Then I added a variable (pxcw) measuring the proximity of the previous
event (1pxcw0)

n= 7286
 coef exp(coef) se(coef) robust se z   p
pxcw   0.9063 2.475   0.42670.4349  2.08 3.7e-02
pop0.3001 1.350   0.10410.1295  2.32 2.0e-02
pib   -0.5485 0.578   0.20140.1799 -3.05 2.3e-03
pib2  -0.4033 0.668   0.14500.1152 -3.50 4.6e-04
crois -0.0541 0.947   0.02360.0227 -2.38 1.7e-02
instab.x1  1.9649 7.134   0.48390.4753  4.13 3.6e-05
instab.autres  0.8498 2.339   0.46930.4594  1.85 6.4e-02

Likelihood ratio test=78.3  on 7 df, p=3.04e-14  n= 7286


   Estimate Std. Error t value Pr(|t|)
sqrt(diag(cox1$naive.var))  0.983970.02199   44.74 8.35e-09 ***
Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965

The naive standard errors (se) seem closer to the robust se than they were
when not modeling for dependence.
0.98397 is very close to one, R^2 grew, etc.
The dependence is high (risk is multiplied by 2.475 the day after an
event)
but conditional independence (given covariates) seems hard to reject.


==
=
CASE 3
Finally, I compared these results with those without repeated events
(which gives a smaller dataset). A country is removed as soon as we
observe its first event.
(robust se is still computed, even if naive se should in fact be used here
to compute the pvalue)

coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
pop + pib + pib2 + crois + instab.x1  + instab.autres, data =
xstep[no.previous.event, ])

 coef exp(coef) se(coef) robust se z   p
pop0.4236 1.528   0.10300.1157  3.66 2.5e-04
pib   -0.7821 0.457   0.20720.1931 -4.05 5.1e-05
pib2  -0.3069 0.736   0.14770.1254 -2.45 1.4e-02
crois -0.0432 0.958   0.02810.0258 -1.67 9.5e-02
instab.x1  1.9925 7.334   0.53210.3578  5.57 2.6e-08
instab.autres  1.3571 3.885   0.54280.5623  2.41 1.6e-02

Likelihood ratio test=66.7  on 6 df, p=1.99e-12  n=5971 (2466 observations
deleted due to missing)


  summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))
   Estimate Std. Error t value Pr(|t|)
sqrt(diag(cox1$naive.var))  0.866820.07826   11.08 0.000104 ***
Residual standard error: 0.06328 on 5 degrees of freedom
Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953


There seems to be no evidence that robust se is more different from se in
case 2 than in case 3 (and case 1).
It even seems closer.

I conclude that conditional independence (martingaleness) cannot be
rejected in CASE 2, when modeling the dependence between events with a
covariate.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France



  Then, there is still another option. In fact, I already modelled
  explicitely the influence of past events with a proximity of last
event
  covariate, assuming the dependence on the last event decreases at a
  constant rate (for instance, the proximity covariate varies from 1 to
0.5
  in the first 10 years after an event, then from 0.5 to 0.25 in the
next
  ten 

Re: [R] covariate selection in cox model (counting process)

2004-07-28 Thread Thomas Lumley
On Wed, 28 Jul 2004, Mayeul KAUFFMANN wrote:

  If you can get the conditional independence (martingaleness) then, yes,
  BIC is fine.
 
  One way to check might be to see how similar the standard errors are
 with
  and without the cluster(id) term.

 (Thank you again !, Thomas.)

 At first look, the values seemed very similar (see below, case 2).
 However, to check this without being too subjective, and without a
 specific test, I needed other values to assess the size of the
 differences: what is similar, what is not?


I think the econometricians have theory for this (comparing the whole
covariance matrices).

-thomas


 ==
 =
 CASE 1
 I first estimated the model without modeling dependence:

 Call:
 coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
 pop + pib + pib2 + crois + instab.x1  + instab.autres, data = xstep)


  coef exp(coef) se(coef) robust se z   p
 pop0.3606 1.434   0.09780.1182  3.05 2.3e-03
 pib   -0.5947 0.552   0.19520.1828 -3.25 1.1e-03
 pib2  -0.4104 0.663   0.14520.1270 -3.23 1.2e-03
 crois -0.0592 0.943   0.02450.0240 -2.46 1.4e-02
 instab.x1  2.2059 9.079   0.46920.4097  5.38 7.3e-08
 instab.autres  0.9550 2.599   0.47000.4936  1.93 5.3e-02

 Likelihood ratio test=74  on 6 df, p=6.2e-14  n= 7286

 There seems to be a strong linear relationship between standard errors
 (se, or naive se) and robust se.

   summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))
 Coefficients:
Estimate Std. Error t value Pr(|t|)
 sqrt(diag(cox1$naive.var))  0.961030.04064   23.65 2.52e-06 ***
 Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894


 ==
 =
 CASE 2

 Then I added a variable (pxcw) measuring the proximity of the previous
 event (1pxcw0)

 n= 7286
  coef exp(coef) se(coef) robust se z   p
 pxcw   0.9063 2.475   0.42670.4349  2.08 3.7e-02
 pop0.3001 1.350   0.10410.1295  2.32 2.0e-02
 pib   -0.5485 0.578   0.20140.1799 -3.05 2.3e-03
 pib2  -0.4033 0.668   0.14500.1152 -3.50 4.6e-04
 crois -0.0541 0.947   0.02360.0227 -2.38 1.7e-02
 instab.x1  1.9649 7.134   0.48390.4753  4.13 3.6e-05
 instab.autres  0.8498 2.339   0.46930.4594  1.85 6.4e-02

 Likelihood ratio test=78.3  on 7 df, p=3.04e-14  n= 7286


Estimate Std. Error t value Pr(|t|)
 sqrt(diag(cox1$naive.var))  0.983970.02199   44.74 8.35e-09 ***
 Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965

 The naive standard errors (se) seem closer to the robust se than they were
 when not modeling for dependence.
 0.98397 is very close to one, R^2 grew, etc.
 The dependence is high (risk is multiplied by 2.475 the day after an
 event)
 but conditional independence (given covariates) seems hard to reject.


 ==
 =
 CASE 3
 Finally, I compared these results with those without repeated events
 (which gives a smaller dataset). A country is removed as soon as we
 observe its first event.
 (robust se is still computed, even if naive se should in fact be used here
 to compute the pvalue)

 coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
 pop + pib + pib2 + crois + instab.x1  + instab.autres, data =
 xstep[no.previous.event, ])

  coef exp(coef) se(coef) robust se z   p
 pop0.4236 1.528   0.10300.1157  3.66 2.5e-04
 pib   -0.7821 0.457   0.20720.1931 -4.05 5.1e-05
 pib2  -0.3069 0.736   0.14770.1254 -2.45 1.4e-02
 crois -0.0432 0.958   0.02810.0258 -1.67 9.5e-02
 instab.x1  1.9925 7.334   0.53210.3578  5.57 2.6e-08
 instab.autres  1.3571 3.885   0.54280.5623  2.41 1.6e-02

 Likelihood ratio test=66.7  on 6 df, p=1.99e-12  n=5971 (2466 observations
 deleted due to missing)


   summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))
Estimate Std. Error t value Pr(|t|)
 sqrt(diag(cox1$naive.var))  0.866820.07826   11.08 0.000104 ***
 Residual standard error: 0.06328 on 5 degrees of freedom
 Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953


 There seems to be no evidence that robust se is more different from se in
 case 2 than in case 3 (and case 1).
 It even seems closer.

 I conclude that conditional independence (martingaleness) cannot be
 rejected in CASE 2, when modeling the dependence between events with a
 covariate.

 Mayeul KAUFFMANN
 Univ. Pierre Mendes France
 Grenoble - France



   Then, there is still another option. In fact, I already modelled
   explicitely the influence of past events with a 

Re: [R] covariate selection in cox model (counting process)

2004-07-27 Thread Thomas Lumley
On Tue, 27 Jul 2004, Mayeul KAUFFMANN wrote:

 Thank you a lot for your time and your answer, Thomas. Like all good
 answers, it raised new questions for me ;-)

 In the case of recurrent events coxph() is not
  using maximum likelihood or even maximum partial likelihood. It is
  maximising the quantity that (roughly speaking) would be the partial
  likelihood if the covariates explained all the cluster differences.

 I could have non repeating events by removing countries once they have
 experienced a war. But I'm not sure it will change the estimation
 procedure since this will change the dataset only, not the formula
 coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T)

 I am not sure I understood you well: do you really mean recurrent events
 alone or any counting process notation (including allowing for recurrent
 events).

No, I mean recurrent events.  With counting process notation but no
recurrent revents the partial likelihood is still valid, and the approach
of treating it as a real likelihood for AIC (and presumably BIC) makes
sense.

Roughly speaking, you can't tell there is dependence until you see
multiple events.


 I thought the counting process notation did not differ really from the Cox
 model in R, since Terry M. Therneau (A Package for Survival Analysis in S,
 April 22, 1996) concludes his mathematical section 3.3 Cox Model by The
 above notation is derived from the counting process representation [...]
 It allows very naturally for several extensions to the original Cox model
 formulation: multiple events per subject, discontinuous intervals of risk
 [...],left truncation. (I used it to introduce 1. time-dependent
 covariates, some covariates changing yearly, other irregularly, and 2.
 left truncation: not all countries existed at the beginning of the study)


 In the case of recurrent events coxph() is not
  using maximum likelihood or even maximum partial likelihood.

 Then, what does fit$loglik give in this case? Still a likelihood or a
 valid criterion to maximise ?
 If not, how to get (manually) the criterion that was maximsed?

fit$loglik gives the criterion that was maximised.  This is the function
of the data that *would be* the partial likelihood if there was no
within-country dependence.

This is a convenient criterion function because it is easy to maximise,
and it is known to give valid (and reasonably efficient) estimates for
what you might call a proportional rates model in the case of recurrent
events.  However, it no longer has the same claim to be a real
likelihood that the Cox partial likelihood does, because it is not
modelling the dependence.


 That's of interest for me since
  I created artificial covariates measuring the proximity since some
 events: exp(-days.since.event/a.chosen.parameter).

 ...and I used fit$loglik to chose a.chosen.parameter from 8 values, for 3
 types of events:

That's fine -- within a single model maximising the criterion function is
valid.  The problem is that you can not assume either that
differences between nested models have a chisquared distribution nor that
the expected change in loglik is the same as the number of parameters.

This means that you don't have any absolute scale for choosing
penalties, which is a problem in model selection -- it is hard to balance the
increase in fit$loglik with the increase in model complexity.

In principle you could use cross-validation to estimate the
cost-complexity tradeoff in these models, but this requires the ability to
compute the criterion function on a subset not included in the model,
which is not entirely straightforward.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-27 Thread Mayeul KAUFFMANN
No, I mean recurrent events.  With counting process notation but no
recurrent revents the partial likelihood is still valid, and the approach
of treating it as a real likelihood for AIC (and presumably BIC) makes
sense.

Roughly speaking, you can't tell there is dependence until you see
multiple events.

Thanks a lot, I got it (well, I hope so)!


I've read in several places that events in the Andersen-Gill model must be
conditionnaly independent, which is sometimes more precisely written as
conditionnaly independent given the covariates

or even more precisely:

the Andersen-Gill (AG) model assumes that each [individual] has a
multi-event counting process with independent increments. The observed
increments must be conditionally independent given the history of all
observable information up to the event times.
(http://www.stat.umu.se/egna/danardono/licdd.pdf)


Then, there is still another option. In fact, I already modelled
explicitely the influence of past events with a proximity of last event
covariate, assuming the dependence on the last event decreases at a
constant rate (for instance, the proximity covariate varies from 1 to 0.5
in the first 10 years after an event, then from 0.5 to 0.25 in the next
ten years, etc).

With a well chosen modelisation of the dependence effect, the events
become conditionnaly independent, I do not need a +cluster(id) term, and I
can use fit$loglik to make a covariate selection based on BIC, right?

Thanks a lot again for your time.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

PS: I wrongly concluded from the R statement (Note: the likelihood ratio
and score tests assume independence of observations within a cluster, the
Wald and robust score tests do not).  that it meant independence between
two consecutive observations (without any event). It made sense to  me
because when only one covariate changes for a given individual, and with a
small change, there is a new observation, with a risk very simlar to the
risk for the previous observation. But there is still independence with
respect to the question of recurrent event. Maybe the warning should be
rewritten saying assume *conditionnal* independence of *events* (given
the covariates)

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-26 Thread Mayeul KAUFFMANN
Hello everyone,
I am searching for a covariate selection procedure in a cox model
formulated
as a counting process.
I use intervals, my formula looks like coxph(Surv(start,stop,status)~
x1+x2+...+cluster(id),robust=T) where id is a country code (I study
occurence of civil wars from 1962 to 1997).
I'd like something not based on p-values, since they have several flaws
for
this purpose.
I turned to other criteria but all the articles I read seems to apply to
the
classical formulation of the cox model, not the counting process one (or
they apply to both but I am not aware of this)

I've tried AIC with
 step(cox.fit)
or
 stepAIC(cox.fit)

and BIC using
step(cox.fit,k = log(n))
but there seems to be 2 theoretical problems to address:

(1) These values are based on partial loglikelihood (loglik)
I wonder if this is correct with the cox model formulated as a *counting
process*, with many (consecutive) observations for a given

individual, and then some observation not being independent

Since the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests

do not (R warning), and the likelihood ratio being based on loglik, can
I
use loglik in BIC with some dependent observations?

[I have 170 individuals (namely, countries) for 36 year, some single
countries having up to 140 very short observation intervals, other
having
(on

the other extreme) only 1 long interval per year. That's because I
created
artificial covariates measuring the proximity since some

events: exp(-days.since.event/a.chosen.parameter). I splitted every
yearly
interval for which these covariates change rapidly (i.e.

when the events are recent) yielding up to 11 intervals a year]

(2) What penalized term to used?

It seems natural to include the number of covariates, k.
What about the number of observations?

I found several definitions:
AIC= -2 loglik(b) +  2.k
Schwartz Bayesian information criteria:
SBIC= -2 loglik(b) +  k ln(n)

Frédérique Letué (author of PhD thesis COX MODEL: ESTIMATION VIA MODEL
SELECTION AND BIVARIATE SHOCK MODEL,
http://www-lmc.imag.fr/lmc-sms/Frederique.Letue/These3.ps)
suggested me
AIC= - loglik(b) +  2.k/n
BIC= - loglik(b) +  2.ln(n).k/n, with other possible values for
parameter
2 in this case (see her thesis p.100, but this section is in French)

All these do not tell *what to take for n*. There are 3 main
possibilities:

a) Taking the number of observations (including censored one) will give
a
huge n (around 6000 to 8000), which may seem meaningless

since some observations are only a few days long.
With n at the denominator (Letué's criteria), the penalized term would
be so
low that it's like not having it:
log(7000)/7000
[1] 0.001264809

(where loglik from summary(cox.fit) range from -155 to -175, dependig on
the
model)

b) Volinsky  Raftery propose a revision of the penalty term in BIC so
that
the penalty is defined in terms of the number of

uncensored events instead of the number of observations. (Volinsky 
Raftery , Bayesian Information Criterion for Censored

Survival Models, June 16, 1999,
http://www.research.att.com/~volinsky/papers/biocs.ps)

This could be computed with
sum(coxph.detail(cox.interac.dsi6mois)$nevent)

Letué's BIC penalized trerm with 50 events will then be
 2*log(50)/50
[1] 0.1564809
which will have more effects.

However, adding or removing a country which has data for the 36 years
but no
event (then, it is censored) will not change this BIC.

Thus, it is not suitable to account for missing data that do not reduce
the
number of event.
I'd like the criteria to take this into account, because all covariates
do
not have the same missing data.
The question is: When I have the choice with adding a covariate, x10 or
x11,
which have different (not nested) set of missing

values, which one is best?
Estimating all subsets of the full model (full model = all covariates)
with
a dataset containing no missing data for the full model

would be a solution but would more than halve the dataset for many
subsets
of the covariates.

I should mention that step(cox.fit) gives a warning and stops:
Error in step(cox.fit) : number of rows in use has changed: remove
missing
values?

which makes me ask whether the whole procedure is OK with model of
different
sample size.


c) For discrete time event history analysis, the same choice has been
made,
while the total number of exposure time units has also

been used, for consistency with logistic regresion
(Raftery,Lewis,Aghajanian
and Kahn,1993;Raftery Lewisand Aghajanian,1994)

(Raftery, Bayesian Model Selection in Social Research, 1994,
http://www.stat.washington.edu/tech.reports/bic.ps)

I am not sure what exposure time units mean. But since I could have
used a
logit model with yearly observations [but with many

flaws...], I suggest I could use the number of years (sum of length of
intervals, in year)

 sum((fit$y)[,2]-(fit$y)[,1])/365.25
[1] 3759.537

This may still be too high.

Since I have datas 

Re: [R] covariate selection in cox model (counting process)

2004-07-26 Thread Thomas Lumley
On Mon, 26 Jul 2004, Mayeul KAUFFMANN wrote:

 Hello everyone,
 I am searching for a covariate selection procedure in a cox model
 formulated
 as a counting process.
 I use intervals, my formula looks like coxph(Surv(start,stop,status)~
 x1+x2+...+cluster(id),robust=T) where id is a country code (I study
 occurence of civil wars from 1962 to 1997).
 I'd like something not based on p-values, since they have several flaws
 for
 this purpose.

You may be out of luck.  In the case of recurrent events coxph() is not
using maximum likelihood or even maximum partial likelihood. It is
maximising the quantity that (roughly speaking) would be the partial
likelihood if the covariates explained all the cluster differences.

Partial likelihood for single events does have an AIC analogue that works
reasonably well (not surprisingly, since the partial likelihood is also a
perfectly valid marginal likelihood for the ranks of the survival times).
For recurrent events this isn't going to work.

If you absolutely have to do covariate selection you may need to look for
a maximum likelihood approach, such as a parametric model with random
effects to describe the dependence.  You might be able to use survreg()
with frailty() terms.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-26 Thread Mayeul KAUFFMANN
Thank you a lot for your time and your answer, Thomas. Like all good
answers, it raised new questions for me ;-)

In the case of recurrent events coxph() is not
 using maximum likelihood or even maximum partial likelihood. It is
 maximising the quantity that (roughly speaking) would be the partial
 likelihood if the covariates explained all the cluster differences.

I could have non repeating events by removing countries once they have
experienced a war. But I'm not sure it will change the estimation
procedure since this will change the dataset only, not the formula
coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T)

I am not sure I understood you well: do you really mean recurrent events
alone or any counting process notation (including allowing for recurrent
events).


I thought the counting process notation did not differ really from the Cox
model in R, since Terry M. Therneau (A Package for Survival Analysis in S,
April 22, 1996) concludes his mathematical section 3.3 Cox Model by The
above notation is derived from the counting process representation [...]
It allows very naturally for several extensions to the original Cox model
formulation: multiple events per subject, discontinuous intervals of risk
[...],left truncation. (I used it to introduce 1. time-dependent
covariates, some covariates changing yearly, other irregularly, and 2.
left truncation: not all countries existed at the beginning of the study)


In the case of recurrent events coxph() is not
 using maximum likelihood or even maximum partial likelihood.

Then, what does fit$loglik give in this case? Still a likelihood or a
valid criterion to maximise ?
If not, how to get (manually) the criterion that was maximsed?

That's of interest for me since
 I created artificial covariates measuring the proximity since some
events: exp(-days.since.event/a.chosen.parameter).

...and I used fit$loglik to chose a.chosen.parameter from 8 values, for 3
types of events:


la-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose
from
z-NULL;for(a1 in la) for(a2 in la) for(a3 in la) {coxtmp -
(coxph(Surv(start,stop,status)~
+I(exp(-days.since.event.of.type.one/a1))
+I(exp(-days.since.event.of.type.two/a2))
+I(exp(-days.since.event.of.type.three/a3))
+ other.time.dependent.covariates
+cluster(id)
,data=x,robust=T))
rbind(z,c(a1,a2,a3,coxtmp$wald.test, coxtmp$rscore, coxtmp$loglik,
coxtmp$score))-z
}
z - data.frame(z)
names(z) - c(a1,a2, a3,wald.test, rscore,
NULLloglik,loglik, score)
z[which.max(z$rscore),]
z[which.max(z$loglik),]

The last two commands gave me almost always the same set for c(a1,a2,a3).
But they sometimes differed significantly on some models.

Which criteria (if any ?!) should I use to select the best set c(a1,a2,a3)
?


(If you wish to see what the proximity variables look like, run the
following code. The dashed lines show the half life of the proximity
variable,here=6 months, which is determined by a.chosen.parameter, e.g.
a1=la[1]:
#start of code
curve(exp(-(x)/263.5),0,8*365.25,xlab=number of days since last political
regime change (dsrc),ylab=Proximity of political regime change =
exp(-dsrc/263.5),las=1)
axis(1,at=365.25/2, labels= (6 months));axis(2,at=seq(0,1,.1),las=1)
lines(c(365.25/2,365.25/2,-110),c(-.05,0.5,0.5),lty=dashed)
#end of code)



Thanks a lot again.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html