[R] Cox PH Warning Message

2004-10-19 Thread Mayeul KAUFFMANN
sorry for late answer.

This is the output. I think it means that there is a problem with
project$pluralgpTriplet???
Neil

Yes.
definitely. there are probably no triplet, or all triplet are coded as
twins too (and there are no real twins). Or the only triplet is the first
individual that dies (in fact, the latter should rather give an infinite
exp(coef), I think).
You also seem to use many qualitative variable. If a combination of two of
the other variables is identical to  pluralgpTriplet, R drops the
variable. (This is true for any regression model.)
To find out which combination, you may regress pluralgpTriplet on all
other variables with lm(). You'll probably have R²=1 and some very low
p-values.

If the number of triplet is very small, I advise you group them with
twins, except if it is a major variable in your model.



Cheers,
Mayeul KAUFFMANN
Université Pierre Mendès France
Grenoble - France

  fm.CPH
Call:
coxph(formula = Surv(age_at_death, death) ~ project$pluralgp +
 project$yrborn + project$private + project$pcdead + project$mheight
+
 project$ga2 + project$apgcode + project$apgar5 + project$VLBWgp +
 project$PEBW + project$LBWgp + project$LBWcat + project$totadm +
 project$totlos + project$sex + teen + project$DS + project$pcsborn +
 project$disadv + project$ecores + project$edoccu)


 coef exp(coef) se(coef)   z   p
project$pluralgpTwin   -0.509203 0.601  0.74415 -0.6843 4.9e-01
project$pluralgpTriplet   NANA  0.0  NA  NA
project$yrborn1988-1992-0.083348 0.920  0.22553 -0.3696 7.1e-01
project$private 0.260465 1.298  0.12384  2.1032 3.5e-02
project$pcdead -0.058200 0.943  0.49418 -0.1178 9.1e-01
project$mheight 0.016361 1.016  0.01745  0.9378 3.5e-01
project$ga2 0.104246 1.110  0.10162  1.0258 3.0e-01
project$apgcode4-7 -0.266885 0.766  0.50211 -0.5315 6.0e-01
project$apgcode1-3 -1.704620 0.182  1.12545 -1.5146 1.3e-01
project$apgar5 -0.427139 0.652  0.14099 -3.0296 2.4e-03
project$VLBWgp=1500 grams  0.046203 1.047  0.65494  0.0705 9.4e-01
project$PEBW0.015297 1.015  0.01356  1.1281 2.6e-01
project$LBWgp=2500 grams  -0.257472 0.773  0.42496 -0.6059 5.4e-01
project$LBWcat -0.222823 0.800  0.23938 -0.9308 3.5e-01
project$totadm  0.005836 1.006  0.01536  0.3800 7.0e-01
project$totlos  0.007342 1.007  0.00176  4.1664 3.1e-05
project$sexFemale  -0.016431 0.984  0.22803 -0.0721 9.4e-01
teen0.286282 1.331  0.34807  0.8225 4.1e-01
project$DSDown syndrome 1.193727 3.299  0.27227  4.3844 1.2e-05
project$pcsborn-0.704743 0.494  0.75943 -0.9280 3.5e-01
project$disadv -0.000573 0.999  0.00250 -0.2296 8.2e-01
project$ecores -0.001588 0.998  0.00249 -0.6392 5.2e-01
project$edoccu  0.000590 1.001  0.00193  0.3054 7.6e-01

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] sapply and loop

2004-10-16 Thread Mayeul KAUFFMANN
Zhen,

how to count the exact time ?

system.time(base)
Returns CPU (and other) times that expr used.


If you only need seconds, you can also do

date();zz-sapply(ma, myfunction);date()

I do not know about how to reduce the time.
For very comlex iterations, I use for( ) myself, which maybe inneficient.

Mayeul KAUFFMANN
Université Pierre Mendès France
Grenoble
France

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Cox PH Warning Message

2004-10-16 Thread Mayeul KAUFFMANN
Hi,
Can anybody tell me what the message below means and how to overcome it.
Thanks,
Neil
Warning message:
X matrix deemed to be singular; variable 2 in: coxph(Surv(age_at_death,
death) ~ project$pluralgp + project$yrborn +  .

Your 2nd covariate (yrborn) is colinear to the other covariates (or nearly
colinear). R drops it, but warns you. The results are OK.

Mayeul KAUFFMANN
Université Pierre Mendès France
Grenoble
France

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] constrained splines in GLM

2004-10-15 Thread Mayeul KAUFFMANN
Hi,
I would like to use constrained splines in a GLM model (Poisson link)
to take into account the nonlinear effect of some covariate. The
constraints I need are described below.
I have several variables that I need concurrently in the same model.

I looked at package mgcv but I do not know if/how I can use it in GLM (not
GAM) : I could
not manage to adapt the mono.con(mgcv) example to GLM.
The help for package fda is not complete.
Not sure that backSpline(splines) does what I need.
isoreg (modreg) seems to do univariate regressions.


Some of my covariates are linear.

Three covariates (x1,x2 and x3) must be transformed in a decreasing and
convex way like

this:

|o
|o
| o
|  o
|o
|o
|   o
|-

Currently, I use exp(-x1/alpha1)+exp(-x2/alpha2)+exp(-x3/alpha3), I try
several alpha's

and choose the best according to log-likelihood.


One variable should have only one local maximum (that is, the derivative
should be zero

only once, which is at the top), like this:
|
|  TOP
|  oo
|  o  o
|o  o
|o o o
| o o
|

with bs() or ns() and no constraint, I get:
|
|  TOP
|  oo
|o  o o
|  o o  o
|o
| o o
|
which is nonsense (note there are very few observations on the left part)


I also tried some parametric forms, choosing via log-likelihood. But with
four covariates,

it is a lot of parameters to try (several hours with little flexible
functions).



I am looking for something similar to ns or bs (package splines), which
are very

convenient to place in the formula of a GLM model. I tried them, adjusting
knots, but

could not manage what I want. Constraints on some derivatives may do the
trick, but I do

not know how to implement them in R.

Any help or comment would be greatly appreciated !

Mayeul KAUFFMANN
Université Pierre Mendès France - Grenoble
France

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?

2004-08-13 Thread Mayeul KAUFFMANN

 you can always use parametric models and a
 full likelihood (but you may have to program them yourself).
 Prof Brian Ripley

I started trying this but I could not make the counting process notation
work on this.
(Andersen, P.K. and Gill, R.D. (1982). Cox's regression model for counting
processes: A large sample study. Ann. stat. 10 , 1100-1120).
I think it is only (currently) available for Cox model with R.

survreg(Surv(start, stop,status)~  x1,data=data )
Error in survreg(Surv(start, stop, status) ~ x1, data = data) :
 Invalid survival type

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] how to plot an array with labels

2004-08-13 Thread Mayeul KAUFFMANN
How can i plot an array and instead of having on the x labels the 
indexes of the array I want to display an other String array of the 
same length

Do this:
 plot(myarray,xaxt=n,xlab=)
 axis(1,at=1:length(myarray),lab=my.vector.of.names)

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?

2004-08-13 Thread Mayeul KAUFFMANN
First, I do not know how to specify such a link function in R.
Second, if I can specify such alink, I could use (in place of d*j), the
smooth baseline estimated after doing a Cox regression. But I don't know
how to fit (for instance) a piecewise constant baseline hazard with a
Poison glm

I found a possible answer to one of my own question, sorry for the
nuisance.
Simply add
+factor(year)
to the poisson model will fit a piecewise constant baseline hazard (one
step every year).
or
+cut(year,br=seq(1961,1997,by=2))
to have one step every two years.
(Hope the rest is correct...)


Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France


- Original Message - 
From: Prof Brian Ripley [EMAIL PROTECTED]
To: Mayeul KAUFFMANN [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, August 13, 2004 8:40 AM
Subject: Re: [R] How to use the whole dataset (including between events)
in Cox model (time-varying covariates) ?


 This is the consequence of the use of partial likelihood in the Cox
model.
 You should read the literature on this point (for example, have you read
 Cox, 1972 and all its discussion, or Anderson, Borgan, Gill  Keiding?).

 It is not an R question.  You need to make more assumptions, such as a
 smooth baseline hazard, and you can always use parametric models and a
 full likelihood (but you may have to program them yourself).


 On Fri, 13 Aug 2004, Mayeul KAUFFMANN wrote:

  Hello,
 
  coxph does not use any information that are in the dataset between
event
  times (or death times) , since computation only occurs at event
times.
 
  For instance, removing observations when there is no event at that
time in
  the whole dataset does not change the results:
   set.seed(1)
   data -
 
as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep(
  0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14)))
   data
  start stop status id x1
  1 1 2 0 1 -0.6264538
  2 2 3 0 1 0.1836433
  3 3 4 0 1 -0.8356286
  4 4 5 0 1 1.5952808
  5 5 6 0 1 0.3295078
  6 1 2 0 2 -0.8204684
  7 2 3 0 2 0.4874291
  8 3 4 1 2 0.7383247
  9 4 5 0 2 0.5757814
  10 5 6 0 2 -0.3053884
  11 1 2 0 3 1.5117812
  12 2 3 0 3 0.3898432
  13 3 4 0 3 -0.6212406
  14 4 5 1 3 -2.2146999
  coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T)
  coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop
%in%
  4:5) ,robust=T) # the same !!! (except n)
 
  First, some data is lost.
  Second, this loss could be an important problem when  there is a
  time-varying covariate that changes quicker than the frequency  of
events.
  Specifically, I have a covariate which has low values most of the
time. It
  sometimes jumps to high values and that is hypothesized as greatly
  increasing the risk of an event.
  With rare events, the effect of this covariate will only be measured
at
  event times. Chances are that the only time such a covariate is
recorded
  at high level, the individual for which it is measured as being high
is
  having an event.
  This may bias the estimated coefficient.
 
  Here is my question:
  How to fully use the dataset?
 
  (that is: how to have really _time-varying_ covariates (even if they
  change step by step, not continuously), not covariates whose changes
are
  measured only at event time )
 
  Ideally, the full dataset would be use to estimate the parameters, or
at
  least to estimate the standard error of the estimated parameters.
  Any ideas ???
  .
  .
  .
 
  A second best (which might require less work) would be to use all the
  dataset to assess the predictive power of the model.
 
  Maybe by using the expected number of events for an individual over
the
  time interval that they were observed to be at risk
   predict(coxfit,type=expected)
  and compare it with observed number of events
  (does it use all data and takes into account all the baseline hazard,
even
  between events?)
 
 
  Or, if not,  following Brian D. Ripley suggestion about the baseline
  hazard: As an approximation you can smooth the fitted
  baseline cumulative hazard (e.g. by package pspline) and ask for its
  derivative
(https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html)
 
  the following code could be use to estimate (and plot) a smooth
baseline
  hazard:
   t-seq(min(data$start),max(data$stop),length=100)
   lines(t,
 
predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2),
  t,1))
  #there is a problem with this code. One should add the contraint that
the
  baseline hazard cannot be negative.
 
  The following computes the parametric part of the cox model.
   risk - predict(coxfit, type='risk') # gives exp(X'b)
 
  something like
   baseline.hazard*risk
  would give the true risk at any time (but it would be probably much
harder
  to compute)
 
  which could help assess the predictive power of the model.
  (still a lot of work)
 
  Thanks in advance for any help or comment.


 -- 
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor of Applied

Re: [R] coxph question

2004-08-13 Thread Mayeul KAUFFMANN
Chris,
I understood the following: you want to try every single covariate in a
cox model with only one covariate, then take the best ones according to
p-value.

Assume your columns  look like:
stop status event x1 x2 x3 etc
You want to add column 3 (x1), then 4, etc.

I suggest a for() loop:

z-NULL;for(i in 3:ncol(data))
{coxtmp - coxph(Surv(stop,status)~ data[,i]) #you can modify the formula
for
   #adding covariates
in any case, for instance
beta-coxtmp$coefficients[1]
se-sqrt(diag(coxtmp$var))[1]
z- rbind(z,c(i,beta,se,pvalue=signif(1 - pchisq((beta/ se)^2, 1), 4)))
print (i)}

#then select the covariates according to the p values:
z[z$pvalue.01,i]

Hope it helps.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France


- Original Message - 
Thanks Mayeul,

I actually would like to test each variable individually and use those
have low p-value to build a classifier (not in cox model). Therefore, I
need to write a function to subset those low p-value variables, instead of
putting them as covariates. Any ideas?

Chris

-Original Message-

 I have many variables to test using cox model (coxph), and I am only
interested in those variables with p value less than 0.01. Is there a
quick way to do this automatically instead of looking at the output of
each variable?
 Chris

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?

2004-08-12 Thread Mayeul KAUFFMANN
Hello,

coxph does not use any information that are in the dataset between event
times (or death times) , since computation only occurs at event  times.

For instance, removing observations when there is no event at that time in
the whole dataset does not change the results:
 set.seed(1)
 data -
as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep(
0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14)))
 data
start stop status id x1
1 1 2 0 1 -0.6264538
2 2 3 0 1 0.1836433
3 3 4 0 1 -0.8356286
4 4 5 0 1 1.5952808
5 5 6 0 1 0.3295078
6 1 2 0 2 -0.8204684
7 2 3 0 2 0.4874291
8 3 4 1 2 0.7383247
9 4 5 0 2 0.5757814
10 5 6 0 2 -0.3053884
11 1 2 0 3 1.5117812
12 2 3 0 3 0.3898432
13 3 4 0 3 -0.6212406
14 4 5 1 3 -2.2146999
coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T)
coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop %in%
4:5) ,robust=T) # the same !!! (except n)

First, some data is lost.
Second, this loss could be an important problem when  there is a
time-varying covariate that changes quicker than the frequency  of events.
Specifically, I have a covariate which has low values most of the time. It
sometimes jumps to high values and that is hypothesized as greatly
increasing the risk of an event.
With rare events, the effect of this covariate will only be measured at
event times. Chances are that the only time such a covariate is recorded
at high level, the individual for which it is measured as being high is
having an event.
This may bias the estimated coefficient.

Here is my question:
How to fully use the dataset?

(that is: how to have really _time-varying_ covariates (even if they
change step by step, not continuously), not covariates whose changes are
measured only at event time )

Ideally, the full dataset would be use to estimate the parameters, or at
least to estimate the standard error of the estimated parameters.
Any ideas ???
.
.
.

A second best (which might require less work) would be to use all the
dataset to assess the predictive power of the model.

Maybe by using the expected number of events for an individual over the
time interval that they were observed to be at risk
 predict(coxfit,type=expected)
and compare it with observed number of events
(does it use all data and takes into account all the baseline hazard, even
between events?)


Or, if not,  following Brian D. Ripley suggestion about the baseline
hazard: As an approximation you can smooth the fitted
baseline cumulative hazard (e.g. by package pspline) and ask for its
derivative (https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html)

the following code could be use to estimate (and plot) a smooth baseline
hazard:
 t-seq(min(data$start),max(data$stop),length=100)
 lines(t,
predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2),
t,1))
#there is a problem with this code. One should add the contraint that the
baseline hazard cannot be negative.

The following computes the parametric part of the cox model.
 risk - predict(coxfit, type='risk') # gives exp(X'b)

something like
 baseline.hazard*risk
would give the true risk at any time (but it would be probably much harder
to compute)

which could help assess the predictive power of the model.
(still a lot of work)

Thanks in advance for any help or comment.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] coxph question

2004-08-12 Thread Mayeul KAUFFMANN
 I have many variables to test using cox model (coxph), and I am only
interested in those variables with p value less than 0.01. Is there a
quick way to do this automatically instead of looking at the output of
each variable?
 Chris

I guess you need covariate selection.
for a lengthy discussion of another method (AIC/BIC), look at last month
archive:
https://www.stat.math.ethz.ch/pipermail/r-help/2004-July/subject.html#53519

and try using
 library(MASS)
 stepAIC (...)

Most of the time, it starts removing the covariate with the lower p-value.


Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] proximity covariate to model dependence in cox model (question from Vito Muggeo)

2004-07-29 Thread Mayeul KAUFFMANN
(I received a messsage from Vito Muggeo [who sometimes posts here] about
my previous posts here. Thus, it might be of interest here)

 dear Mayeul,
 I read your post on R -list about survival analysis with multiple (two
if I
 am not right) events  per subject.


Sometimes subjetcs have even 3 events (i.e. civil wars for last 40
years)


 1)How did you calculate the pxcw variable?
(Note: pxcm stands for ProXimity.of.lastCivil.War)

the answer is at
https://www.stat.math.ethz.ch/pipermail/r-help/2004-July/053556.html
after the words I used fit$loglik to chose a.chosen.parameter from 8
values, for 3 types of events:

I used (exp(-days.since.event.of.type.one/a1)), which gives an indicator
between 1 and 0 that falls quickly first then more slowly (see the end of
the above post,  053556.html)

Simply only use  event.of.type.one. I am not doing a competing risk
analysis: I am only counting one type of event (type.one), but the
covariates measure other events that may increase the risk.

The difficulty is to choose the good a1 parameter. You will typically need
values much lower than mine with your exapmle dataset.

It then become:

la-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose.
Adapt it
from
z-NULL;for(a1 in la) {coxtmp -
(coxph(Surv(start,stop,status)~
+I(exp(-days.since.event.of.type.one/a1))
+ other.time.dependent.covariates
+cluster(id)
,data=x,robust=T))
rbind(z,c(a1,coxtmp$wald.test, coxtmp$rscore, coxtmp$loglik,
coxtmp$score))-z
}
z - data.frame(z)
names(z) - c(a1,wald.test, rscore,
NULLloglik,loglik, score)
z[which.max(z$rscore),]
z[which.max(z$loglik),]


 Namely, for instance,given the following data set (standard in multiple
 event formulation):
  data.frame(id=c(3,3,4,4,5),start=c(0,10,0,7,0),stop=c(10,15,7,20,9),
 + event=c(1,1,1,0,0),stratum=c(1,2,1,2,1))
   id start stop event stratum
 1  3 0   10 1   1
 2  310   15 1   2
 3  4 07 1   1
 4  4 7   20 0   2
 5  5 09 0   1
 how the pxcw variable is computed for each *row* in the dataset?
 should be something like ifelse(stratum==1,0,(stop-start)), i.e.
 (0,5,0,13,0) , or am I wrong?

It looks computationnaly OK, but strange to me.You can choose the function
that matches best the dependence, maybe it is.
But, assuming risk jumps just after an event then decreases slowly , I
would rather use a very high value when there is no previous event, for
instance:
ifelse(stratum==1,100,(stop-start))
Otherwise, a very close event (yesterday) would nearly be coded like no
previous event!
or even better, truncate the decrease :
ifelse(stratum==1,100,ifelse(stop-start100, 100,stop-start ))
Here, after 100 days, it is like having no previous event.

But you should divide your observations to account for the change in the
proximity variable.
Instead of:
 id start stop event stratum   proxim
 4 07 1   1100
 4 7   20 0   213

use

 id start stop event stratum   proxim
 4 07 1   1100
 4 78 0   21
 4 89 0   22
.
.
 4 19   200   213

With the function I used, I used a covariate (days.since.event) which was
coded 999 if no previous event, I need:
exp((-days.since.event)/a1)



 Is this a your idea or did you find it elsewhere? could you give me any
 reference?

The proximity covariate is from
http://www.worldbank.org/research/conflict/papers/CivilPeace.pdf

(But they do not use the counting process we use here. They only measure
covariates for all countries when one country experiments an event.
Thus, all the remaining is mine (I think))



 Hope you can help me,
 regards,
 vito


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-28 Thread Mayeul KAUFFMANN
 If you can get the conditional independence (martingaleness) then, yes,
 BIC is fine.

 One way to check might be to see how similar the standard errors are
with
 and without the cluster(id) term.

(Thank you again !, Thomas.)

At first look, the values seemed very similar (see below, case 2).
However, to check this without being too subjective, and without a
specific test, I needed other values to assess the size of the
differences: what is similar, what is not?


==
=
CASE 1
I first estimated the model without modeling dependence:

Call:
coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
pop + pib + pib2 + crois + instab.x1  + instab.autres, data = xstep)


 coef exp(coef) se(coef) robust se z   p
pop0.3606 1.434   0.09780.1182  3.05 2.3e-03
pib   -0.5947 0.552   0.19520.1828 -3.25 1.1e-03
pib2  -0.4104 0.663   0.14520.1270 -3.23 1.2e-03
crois -0.0592 0.943   0.02450.0240 -2.46 1.4e-02
instab.x1  2.2059 9.079   0.46920.4097  5.38 7.3e-08
instab.autres  0.9550 2.599   0.47000.4936  1.93 5.3e-02

Likelihood ratio test=74  on 6 df, p=6.2e-14  n= 7286

There seems to be a strong linear relationship between standard errors
(se, or naive se) and robust se.

  summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))
Coefficients:
   Estimate Std. Error t value Pr(|t|)
sqrt(diag(cox1$naive.var))  0.961030.04064   23.65 2.52e-06 ***
Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894


==
=
CASE 2

Then I added a variable (pxcw) measuring the proximity of the previous
event (1pxcw0)

n= 7286
 coef exp(coef) se(coef) robust se z   p
pxcw   0.9063 2.475   0.42670.4349  2.08 3.7e-02
pop0.3001 1.350   0.10410.1295  2.32 2.0e-02
pib   -0.5485 0.578   0.20140.1799 -3.05 2.3e-03
pib2  -0.4033 0.668   0.14500.1152 -3.50 4.6e-04
crois -0.0541 0.947   0.02360.0227 -2.38 1.7e-02
instab.x1  1.9649 7.134   0.48390.4753  4.13 3.6e-05
instab.autres  0.8498 2.339   0.46930.4594  1.85 6.4e-02

Likelihood ratio test=78.3  on 7 df, p=3.04e-14  n= 7286


   Estimate Std. Error t value Pr(|t|)
sqrt(diag(cox1$naive.var))  0.983970.02199   44.74 8.35e-09 ***
Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965

The naive standard errors (se) seem closer to the robust se than they were
when not modeling for dependence.
0.98397 is very close to one, R^2 grew, etc.
The dependence is high (risk is multiplied by 2.475 the day after an
event)
but conditional independence (given covariates) seems hard to reject.


==
=
CASE 3
Finally, I compared these results with those without repeated events
(which gives a smaller dataset). A country is removed as soon as we
observe its first event.
(robust se is still computed, even if naive se should in fact be used here
to compute the pvalue)

coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
pop + pib + pib2 + crois + instab.x1  + instab.autres, data =
xstep[no.previous.event, ])

 coef exp(coef) se(coef) robust se z   p
pop0.4236 1.528   0.10300.1157  3.66 2.5e-04
pib   -0.7821 0.457   0.20720.1931 -4.05 5.1e-05
pib2  -0.3069 0.736   0.14770.1254 -2.45 1.4e-02
crois -0.0432 0.958   0.02810.0258 -1.67 9.5e-02
instab.x1  1.9925 7.334   0.53210.3578  5.57 2.6e-08
instab.autres  1.3571 3.885   0.54280.5623  2.41 1.6e-02

Likelihood ratio test=66.7  on 6 df, p=1.99e-12  n=5971 (2466 observations
deleted due to missing)


  summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1))
   Estimate Std. Error t value Pr(|t|)
sqrt(diag(cox1$naive.var))  0.866820.07826   11.08 0.000104 ***
Residual standard error: 0.06328 on 5 degrees of freedom
Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953


There seems to be no evidence that robust se is more different from se in
case 2 than in case 3 (and case 1).
It even seems closer.

I conclude that conditional independence (martingaleness) cannot be
rejected in CASE 2, when modeling the dependence between events with a
covariate.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France



  Then, there is still another option. In fact, I already modelled
  explicitely the influence of past events with a proximity of last
event
  covariate, assuming the dependence on the last event decreases at a
  constant rate (for instance, the proximity covariate varies from 1 to
0.5
  in the first 10 years after an event, then from 0.5 to 0.25 in the
next
  ten years

[R] BIC vz SBIC vz SIC

2004-07-27 Thread Mayeul KAUFFMANN
 2) question: alwasy on BIC, from stepAIC() function help page I found a
 k=log(n) argument to add. Since that produce an error, is there a way
to
 found the n dinamically?

stepAIC(mydata.logistic, trace = F, k=log(nrow(mydata)))

-- 
Daniele Medri

(It was 3 weeks ago but I was just myself faced to the same question)

Just a little warning, it is not really dynamically done: stepAIC takes
the part of mydata that was used to fit mydata.logistic but
k=log(nrow(mydata)) will still use the entire set.
thus, if there were some missing data in mydata, I suggest you use
something like:
k=log(sum(complete.cases(mydata)))
which will be smaller than k=log(nrow(mydata)).

If some data are missing only for a covariate and this one is withdrawn,
you will have a warning and it will stop:

Error in stepAIC(mydata.logistic (etc.)
 number of rows in use has changed: remove missing values?

Since (I presume) these criteria need to compare two models with identical
dataset.

In any other case, if you have NAs at the start but the number of NAs does
not change in the process, you won't be warned if you use
k=log(nrow(mydata)).

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Reading and treating multiple files....

2004-07-27 Thread Mayeul KAUFFMANN
I will get - in a few days-  a set of  separate files(one for each
records~80'000 files pro year)
 named key.txt in which other info (measurements of pressure and velocity
at each second )
is stocked; typically each one of this separate files will hold something
between 200 and
1500 recordsI'm supposed to do statistical analysis on this dataset,
on the yearly information
Well, as you can suspect my problem is thus:
- batch processing of these individual files :reading each file
(automatically!),

I suggest you put all files in a single directory, then you use
list.files ()  or dir()
to make a list of this files.
myfiles - list.files (c:\\mydir)  #on windows
Then you do a loop over these files with read.table()
You can create a column named file with the name of the file.

All of this can be done with:

data - NULL;for (i in dir(c:\\temp)) data -
rbind(data,cbind(file=i,read.table( paste(c:\\mydir\\,i,sep=

Hope that helps

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-27 Thread Mayeul KAUFFMANN
No, I mean recurrent events.  With counting process notation but no
recurrent revents the partial likelihood is still valid, and the approach
of treating it as a real likelihood for AIC (and presumably BIC) makes
sense.

Roughly speaking, you can't tell there is dependence until you see
multiple events.

Thanks a lot, I got it (well, I hope so)!


I've read in several places that events in the Andersen-Gill model must be
conditionnaly independent, which is sometimes more precisely written as
conditionnaly independent given the covariates

or even more precisely:

the Andersen-Gill (AG) model assumes that each [individual] has a
multi-event counting process with independent increments. The observed
increments must be conditionally independent given the history of all
observable information up to the event times.
(http://www.stat.umu.se/egna/danardono/licdd.pdf)


Then, there is still another option. In fact, I already modelled
explicitely the influence of past events with a proximity of last event
covariate, assuming the dependence on the last event decreases at a
constant rate (for instance, the proximity covariate varies from 1 to 0.5
in the first 10 years after an event, then from 0.5 to 0.25 in the next
ten years, etc).

With a well chosen modelisation of the dependence effect, the events
become conditionnaly independent, I do not need a +cluster(id) term, and I
can use fit$loglik to make a covariate selection based on BIC, right?

Thanks a lot again for your time.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

PS: I wrongly concluded from the R statement (Note: the likelihood ratio
and score tests assume independence of observations within a cluster, the
Wald and robust score tests do not).  that it meant independence between
two consecutive observations (without any event). It made sense to  me
because when only one covariate changes for a given individual, and with a
small change, there is a new observation, with a risk very simlar to the
risk for the previous observation. But there is still independence with
respect to the question of recurrent event. Maybe the warning should be
rewritten saying assume *conditionnal* independence of *events* (given
the covariates)

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-26 Thread Mayeul KAUFFMANN
 from 1962 to 1997 (36 years), I have the folowing
numbers
of complete cases-equivalent:
sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36)
[1] 104.4316
This seems more resonable and would account for NAs in different models.

However, it might be to high, because some countries are not at risk
during
the all period: some did not existed because they gain

independence near the end of the period (E.G. ex-USRR countries in arly
1990's) or because they were experiencing an event (new

wars in countries already experiencing a war are not taken into
account).

I may take the *proportion* of available data to time at risk to adjust
for
this: a country at risk during 1 year and for which

data are available for this entire year will increase n by 1, not by
1/36.
If the data frame dataset contains all countries at risk (including
some
with NA), assuming id == id[complete.cases(dataset)]

(all countries have at least one complete observation) this will be
computed
by

attach(dataset)
sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)]
,CCO
DE,sum) /tapply(stop-start,CCODE,sum))

But this would be a rather empirical adjustment, maybe with no
theoretical
basis.
And I don't think I can enter this as argument k to step()


Thank you for having read this. Hope I was not too long.
And thank you a lot for any help, comment, etc.

(sorry for mistakes in English as I'm a non native English speaker)

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Re: smooth non cumulative baseline hazard in Cox model

2004-07-26 Thread Mayeul KAUFFMANN
 from 1962 to 1997 (36 years), I have the folowing numbers
of complete cases-equivalent:
sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36)
[1] 104.4316
This seems more resonable and would account for NAs in different models.

However, it might be to high, because some countries are not at risk during
the all period: some did not existed because they gain

independence near the end of the period (E.G. ex-USRR countries in arly
1990's) or because they were experiencing an event (new

wars in countries already experiencing a war are not taken into account).

I may take the *proportion* of available data to time at risk to adjust for
this: a country at risk during 1 year and for which

data are available for this entire year will increase n by 1, not by 1/36.
If the data frame dataset contains all countries at risk (including some
with NA), assuming id == id[complete.cases(dataset)]

(all countries have at least one complete observation) this will be computed
by

attach(dataset)
sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)],CCO
DE,sum) /tapply(stop-start,CCODE,sum))

But this would be a rather empirical adjustment, maybe with no theoretical
basis.
And I don't think I can enter this as argument k to step()


Thank you for having read this. Hope I was not too long.
And thank you a lot for any help, comment, etc.

(sorry for mistakes in English as I'm a non native English speaker)

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] covariate selection in cox model (counting process)

2004-07-26 Thread Mayeul KAUFFMANN
Thank you a lot for your time and your answer, Thomas. Like all good
answers, it raised new questions for me ;-)

In the case of recurrent events coxph() is not
 using maximum likelihood or even maximum partial likelihood. It is
 maximising the quantity that (roughly speaking) would be the partial
 likelihood if the covariates explained all the cluster differences.

I could have non repeating events by removing countries once they have
experienced a war. But I'm not sure it will change the estimation
procedure since this will change the dataset only, not the formula
coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T)

I am not sure I understood you well: do you really mean recurrent events
alone or any counting process notation (including allowing for recurrent
events).


I thought the counting process notation did not differ really from the Cox
model in R, since Terry M. Therneau (A Package for Survival Analysis in S,
April 22, 1996) concludes his mathematical section 3.3 Cox Model by The
above notation is derived from the counting process representation [...]
It allows very naturally for several extensions to the original Cox model
formulation: multiple events per subject, discontinuous intervals of risk
[...],left truncation. (I used it to introduce 1. time-dependent
covariates, some covariates changing yearly, other irregularly, and 2.
left truncation: not all countries existed at the beginning of the study)


In the case of recurrent events coxph() is not
 using maximum likelihood or even maximum partial likelihood.

Then, what does fit$loglik give in this case? Still a likelihood or a
valid criterion to maximise ?
If not, how to get (manually) the criterion that was maximsed?

That's of interest for me since
 I created artificial covariates measuring the proximity since some
events: exp(-days.since.event/a.chosen.parameter).

...and I used fit$loglik to chose a.chosen.parameter from 8 values, for 3
types of events:


la-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose
from
z-NULL;for(a1 in la) for(a2 in la) for(a3 in la) {coxtmp -
(coxph(Surv(start,stop,status)~
+I(exp(-days.since.event.of.type.one/a1))
+I(exp(-days.since.event.of.type.two/a2))
+I(exp(-days.since.event.of.type.three/a3))
+ other.time.dependent.covariates
+cluster(id)
,data=x,robust=T))
rbind(z,c(a1,a2,a3,coxtmp$wald.test, coxtmp$rscore, coxtmp$loglik,
coxtmp$score))-z
}
z - data.frame(z)
names(z) - c(a1,a2, a3,wald.test, rscore,
NULLloglik,loglik, score)
z[which.max(z$rscore),]
z[which.max(z$loglik),]

The last two commands gave me almost always the same set for c(a1,a2,a3).
But they sometimes differed significantly on some models.

Which criteria (if any ?!) should I use to select the best set c(a1,a2,a3)
?


(If you wish to see what the proximity variables look like, run the
following code. The dashed lines show the half life of the proximity
variable,here=6 months, which is determined by a.chosen.parameter, e.g.
a1=la[1]:
#start of code
curve(exp(-(x)/263.5),0,8*365.25,xlab=number of days since last political
regime change (dsrc),ylab=Proximity of political regime change =
exp(-dsrc/263.5),las=1)
axis(1,at=365.25/2, labels= (6 months));axis(2,at=seq(0,1,.1),las=1)
lines(c(365.25/2,365.25/2,-110),c(-.05,0.5,0.5),lty=dashed)
#end of code)



Thanks a lot again.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] smooth non cumulative baseline hazard in Cox model

2004-07-04 Thread Mayeul KAUFFMANN
Hi everyone.
There's been several threads on baseline hazard in Cox model but I think
they were all on cumulative baseline hazard,
for instance
http://tolstoy.newcastle.edu.au/R/help/01a/0464.html
http://tolstoy.newcastle.edu.au/R/help/01a/0436.html


basehaz in package survival seems to do a cumulative hazard.

extract from the basehaz function:
sfit - survfit(fit)
H - -log(sfit$surv)

Since sfit$surv is monotonic, H will be monotonic too, which makes me think
it is a cumulative function. I think H(t) it is the sum (integration) of
lambda's from 0 to t (Am I right?)

What I need might be lambda(t) or lambda(t)dt  (I do not know for sure),
something involving the instantaneous baseline risk.
But, for sure,  I 've seen elsewhere what I need.
Specifically, here:
http://econ.worldbank.org/files/13214_CivilPeace.pdf
that is page 41 of HEGRE, Håvard; ELLINGSEN, Tanja; GATES, Scott; GLEDITSCH,
Nils Petter (2001). Toward a Democratic Civil Peace? Democracy, Political
Change, and Civil War, 1816-1992. American Political Science Review, vol.
95, no 1.

I'm doing the same job as Hegre et al. (studying civil wars) but with the
counting process formulation of the Cox model. (I use intervals, my formula
looks like Surv(start,stop,status)~  etc.).
Like Hegre and alii (who use the stata software) I would like to have a
curve showing what is the (instantaneous) overall risk of civil war at a
given time, taking away the effect of the covariates.

For those who also use the SAS software (which I'm not, unfortunately), the
job I need to be done seems to be done by the SMOOTH macro described in

Survival Analysis Using the SAS System: A Practical Guide by Paul D.
Allison (see below).

There is a graph (output 5.22 smoothes hazard functions for two financial
aid groups) P. 170 of his book which shows another example (except I only
need it for 1 group, at mean values)



I hope I am clear enough. Thank you a lot for any help.
(sorry for mistakes in English as I'm on non native English speaker)

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France


SMOOTH Macro  (SAS)



%macro smooth (data=_last_, time=, width=, survival=survival);

/*

MACRO SMOOTH produces graphs of smoothed hazard functions using output

from either PROC LIFETEST or PROC PHREG. With PROC LIFETEST, it uses the

data set produced by the OUTSURV option in the PROC statement. With PROC

PHREG, it uses the data set produced by the BASELINE statement. SMOOTH

employs a kernel smoothing method described by H. Ramlau-Hansen (1983),

Smoothing Counting Process Intensities by Means of Kernel Functions,

The Annals of Statistics 11, 453-466. If there is more than one survival

curve in the input data set, SMOOTH will produce multiple smoothed

hazard curves on the same axes.

There are four parameters:

DATA is the name of the data set containing survivor function

estimates. The default is the most recently created data set.

TIME is name of the variable containing event times.

SURVIVAL is the name of a variable containing survivor function

estimates (the default is SURVIVAL, which is the automatic name in

PROC LIFETEST).

WIDTH is bandwidth of smoothing function. The default is 1/5 of the range

of event times.

Example of usage:

%smooth(data=my.data,time=duration,width=8,survival=s)

Author: Paul D. Allison, University of Pennsylvania

[EMAIL PROTECTED]

*/

data _inset_;

set data end=final;

retain _grp_ _censor_ 0;

t=time;

survival=survival;

if t=0 and survival=1 then _grp_=_grp_+1;

keep _grp_ t survival;

if final and _grp_  1 then call symput('nset','yes');

else if final then call symput('nset','no');

if _censor_ = 1 then delete;

if survival in (0,1) then delete;

run;

proc iml;

use _inset_;

read all var {t _grp_};

%if width ne %then %let w2=width;

%else %let w2=(max(t)-min(t))/5;

w=w2;

z=char(w,8,2);

call symput('width',z);

numset=max(_grp_);

create _plt_ var{ lambda s group};

setin _inset_ ;

do m=1 to numset;

read all var {t survival _grp_} where (_grp_=m);

n=nrow(survival);

lo=t[1] + w;

hi=t[n] - w;

npt=50;

inc=(hi-lo)/npt;

s=lo+(1:npt)`*inc;

group=j(npt,1,m);

slag=1//survival[1:n-1];

h=1-survival/slag;

x = (j(npt,1,1)*t` - s*j(1,n,1))/w;

k=.75*(1-x#x)#(abs(x)=1);

lambda=k*h/w;

append;

end;

quit;

%if nset = yes %then %let c==group;

%else %let c=;

proc gplot data=_plt_;

plot lambda*s c / vaxis=axis1 vzero haxis=axis2;

axis1 label=(angle=90 f=titalic 'Hazard Function' ) minor=none ;

axis2 label=(f=titalic Time (bandwidth=width)) minor=none;

symbol1 i=join color=black line=1;

symbol2 i=join color=red line=2;

symbol3 i=join color=green line=3;

symbol4 i=join color=blue line=4;

run;

quit;

%mend smooth;

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Re: [R] smooth non cumulative baseline hazard in Cox model

2004-07-04 Thread Mayeul KAUFFMANN
Thank you all for your quick answers.

With respect to my question on smooth noncumulative baseline cox hazard, I
followed Prof Brian Ripley and I used the following:

library(survival)
plot(basehaz(coxfinal2)[,2]/365.25+1945,basehaz(coxfinal2)[,1],t=l)
xx -
seq(min(basehaz(coxfinal2)[,2]/365.25+1945),max(basehaz(coxfinal2)[,2]/365.2
5+1945),length=100) #my start value was 1st january 1945
library(pspline)
lines(xx,
predict(sm.spline(x=basehaz(coxfinal2)[,2]/365.25+1945,y=basehaz(coxfinal2)[
,1],norder=2), xarg=xx,nderiv=1))

it might seem that computing the derivative when time is expressed in years
gives the annual probability of event.
The previous commands give a graphic exactly identical to:

plot(basehaz(coxfinal2)[,2],basehaz(coxfinal2)[,1],t=l)
xx -
seq(min(basehaz(coxfinal2)[,2]),max(basehaz(coxfinal2)[,2]),length=100)
lines(xx,
365.25*predict(sm.spline(x=basehaz(coxfinal2)[,2],y=basehaz(coxfinal2)[,1],n
order=2), xarg=xx,nderiv=1))  # [second command]

However, if p is the probability of event for the 1st day of a given year,
it is not obvious to me
that the probability that there is one event for the 1st year equals 365*p.
Am I mistaken? If no, what does the second command computes?

So if someone can help me say what is the time unit for the risk shown by
lines(xx,
predict(sm.spline(x=basehaz(coxfinal2)[,2]/365.25+1945,y=basehaz(coxfinal2)[
,1],norder=2), xarg=xx,nderiv=1))
...

@@@

With respect to censoring, I think we all agree:

Peter Dalgaard wrote:
 Prof Brian Ripley [EMAIL PROTECTED] writes:

   I'm doing the same job as Hegre et al. (studying civil wars) but with
the
   counting process formulation of the Cox model. (I use intervals, my
formula
   looks like Surv(start,stop,status)~  etc.).
 
  Careful, that is left- and right- censored, not intervals.  Surv has a
  type= argument.

 Nitpick: That's left-*truncated* and right-censored (the status refers
 to the condition at the right end, people who die before the start are
 not registered at all).

I use the following dataset:
idstartstopstatus  ... covariates
113650...
13654001... [the war starts at 400 and ens at 550]
15507300... [there are possibly repeated events so the
country re-enters the study]
213650...
23657300...
etc...
where there is one id for every country, that is several lines for each
country (each line thus representing an interval of time).

with
coxph(Surv(start, stop, status, type = interval) ~ x1+...+cluster(id)

I did not meant interval censoring (althought I think it is present here for
country 1 from time 400 to 550), I meant interval in the same meaning as
in the R help for Surv:
time2ending
time of the interval for interval censored or counting process data only.
Intervals are assumed to be open on the left and closed on the right,
(start, end]. For counting process data, event indicates whether an event
occurred at the end of the interval.
Surv has a  type= argument. Yes, and the help says The default is right
or counting depending on whether the time2 argument is absent or present,
respectively. Here, I omited the type, which means I used a counting
process.

Thus, the union of all intervals for country 2 (here, lines 4 and 5) lead to
one big interval which is left truncated and right censored.

Anyway, I think there is no ambiguity, since if one tries type =
interval   it says:
Error in coxph(Surv(start, stop, status, type = interval) ~ 
 Cox model doesn't support interval survival data

But thanks to Prof. Ripley for the comment, as I am not fully aware of the
exact terminology in English.

Regards,

Mayeul KAUFFMANN

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html