[R] Cox PH Warning Message
sorry for late answer. This is the output. I think it means that there is a problem with project$pluralgpTriplet??? Neil Yes. definitely. there are probably no triplet, or all triplet are coded as twins too (and there are no real twins). Or the only triplet is the first individual that dies (in fact, the latter should rather give an infinite exp(coef), I think). You also seem to use many qualitative variable. If a combination of two of the other variables is identical to pluralgpTriplet, R drops the variable. (This is true for any regression model.) To find out which combination, you may regress pluralgpTriplet on all other variables with lm(). You'll probably have R²=1 and some very low p-values. If the number of triplet is very small, I advise you group them with twins, except if it is a major variable in your model. Cheers, Mayeul KAUFFMANN Université Pierre Mendès France Grenoble - France fm.CPH Call: coxph(formula = Surv(age_at_death, death) ~ project$pluralgp + project$yrborn + project$private + project$pcdead + project$mheight + project$ga2 + project$apgcode + project$apgar5 + project$VLBWgp + project$PEBW + project$LBWgp + project$LBWcat + project$totadm + project$totlos + project$sex + teen + project$DS + project$pcsborn + project$disadv + project$ecores + project$edoccu) coef exp(coef) se(coef) z p project$pluralgpTwin -0.509203 0.601 0.74415 -0.6843 4.9e-01 project$pluralgpTriplet NANA 0.0 NA NA project$yrborn1988-1992-0.083348 0.920 0.22553 -0.3696 7.1e-01 project$private 0.260465 1.298 0.12384 2.1032 3.5e-02 project$pcdead -0.058200 0.943 0.49418 -0.1178 9.1e-01 project$mheight 0.016361 1.016 0.01745 0.9378 3.5e-01 project$ga2 0.104246 1.110 0.10162 1.0258 3.0e-01 project$apgcode4-7 -0.266885 0.766 0.50211 -0.5315 6.0e-01 project$apgcode1-3 -1.704620 0.182 1.12545 -1.5146 1.3e-01 project$apgar5 -0.427139 0.652 0.14099 -3.0296 2.4e-03 project$VLBWgp=1500 grams 0.046203 1.047 0.65494 0.0705 9.4e-01 project$PEBW0.015297 1.015 0.01356 1.1281 2.6e-01 project$LBWgp=2500 grams -0.257472 0.773 0.42496 -0.6059 5.4e-01 project$LBWcat -0.222823 0.800 0.23938 -0.9308 3.5e-01 project$totadm 0.005836 1.006 0.01536 0.3800 7.0e-01 project$totlos 0.007342 1.007 0.00176 4.1664 3.1e-05 project$sexFemale -0.016431 0.984 0.22803 -0.0721 9.4e-01 teen0.286282 1.331 0.34807 0.8225 4.1e-01 project$DSDown syndrome 1.193727 3.299 0.27227 4.3844 1.2e-05 project$pcsborn-0.704743 0.494 0.75943 -0.9280 3.5e-01 project$disadv -0.000573 0.999 0.00250 -0.2296 8.2e-01 project$ecores -0.001588 0.998 0.00249 -0.6392 5.2e-01 project$edoccu 0.000590 1.001 0.00193 0.3054 7.6e-01 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] sapply and loop
Zhen, how to count the exact time ? system.time(base) Returns CPU (and other) times that expr used. If you only need seconds, you can also do date();zz-sapply(ma, myfunction);date() I do not know about how to reduce the time. For very comlex iterations, I use for( ) myself, which maybe inneficient. Mayeul KAUFFMANN Université Pierre Mendès France Grenoble France __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Cox PH Warning Message
Hi, Can anybody tell me what the message below means and how to overcome it. Thanks, Neil Warning message: X matrix deemed to be singular; variable 2 in: coxph(Surv(age_at_death, death) ~ project$pluralgp + project$yrborn + . Your 2nd covariate (yrborn) is colinear to the other covariates (or nearly colinear). R drops it, but warns you. The results are OK. Mayeul KAUFFMANN Université Pierre Mendès France Grenoble France __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] constrained splines in GLM
Hi, I would like to use constrained splines in a GLM model (Poisson link) to take into account the nonlinear effect of some covariate. The constraints I need are described below. I have several variables that I need concurrently in the same model. I looked at package mgcv but I do not know if/how I can use it in GLM (not GAM) : I could not manage to adapt the mono.con(mgcv) example to GLM. The help for package fda is not complete. Not sure that backSpline(splines) does what I need. isoreg (modreg) seems to do univariate regressions. Some of my covariates are linear. Three covariates (x1,x2 and x3) must be transformed in a decreasing and convex way like this: |o |o | o | o |o |o | o |- Currently, I use exp(-x1/alpha1)+exp(-x2/alpha2)+exp(-x3/alpha3), I try several alpha's and choose the best according to log-likelihood. One variable should have only one local maximum (that is, the derivative should be zero only once, which is at the top), like this: | | TOP | oo | o o |o o |o o o | o o | with bs() or ns() and no constraint, I get: | | TOP | oo |o o o | o o o |o | o o | which is nonsense (note there are very few observations on the left part) I also tried some parametric forms, choosing via log-likelihood. But with four covariates, it is a lot of parameters to try (several hours with little flexible functions). I am looking for something similar to ns or bs (package splines), which are very convenient to place in the formula of a GLM model. I tried them, adjusting knots, but could not manage what I want. Constraints on some derivatives may do the trick, but I do not know how to implement them in R. Any help or comment would be greatly appreciated ! Mayeul KAUFFMANN Université Pierre Mendès France - Grenoble France __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?
you can always use parametric models and a full likelihood (but you may have to program them yourself). Prof Brian Ripley I started trying this but I could not make the counting process notation work on this. (Andersen, P.K. and Gill, R.D. (1982). Cox's regression model for counting processes: A large sample study. Ann. stat. 10 , 1100-1120). I think it is only (currently) available for Cox model with R. survreg(Surv(start, stop,status)~ x1,data=data ) Error in survreg(Surv(start, stop, status) ~ x1, data = data) : Invalid survival type __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] how to plot an array with labels
How can i plot an array and instead of having on the x labels the indexes of the array I want to display an other String array of the same length Do this: plot(myarray,xaxt=n,xlab=) axis(1,at=1:length(myarray),lab=my.vector.of.names) __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?
First, I do not know how to specify such a link function in R. Second, if I can specify such alink, I could use (in place of d*j), the smooth baseline estimated after doing a Cox regression. But I don't know how to fit (for instance) a piecewise constant baseline hazard with a Poison glm I found a possible answer to one of my own question, sorry for the nuisance. Simply add +factor(year) to the poisson model will fit a piecewise constant baseline hazard (one step every year). or +cut(year,br=seq(1961,1997,by=2)) to have one step every two years. (Hope the rest is correct...) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France - Original Message - From: Prof Brian Ripley [EMAIL PROTECTED] To: Mayeul KAUFFMANN [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, August 13, 2004 8:40 AM Subject: Re: [R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ? This is the consequence of the use of partial likelihood in the Cox model. You should read the literature on this point (for example, have you read Cox, 1972 and all its discussion, or Anderson, Borgan, Gill Keiding?). It is not an R question. You need to make more assumptions, such as a smooth baseline hazard, and you can always use parametric models and a full likelihood (but you may have to program them yourself). On Fri, 13 Aug 2004, Mayeul KAUFFMANN wrote: Hello, coxph does not use any information that are in the dataset between event times (or death times) , since computation only occurs at event times. For instance, removing observations when there is no event at that time in the whole dataset does not change the results: set.seed(1) data - as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep( 0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14))) data start stop status id x1 1 1 2 0 1 -0.6264538 2 2 3 0 1 0.1836433 3 3 4 0 1 -0.8356286 4 4 5 0 1 1.5952808 5 5 6 0 1 0.3295078 6 1 2 0 2 -0.8204684 7 2 3 0 2 0.4874291 8 3 4 1 2 0.7383247 9 4 5 0 2 0.5757814 10 5 6 0 2 -0.3053884 11 1 2 0 3 1.5117812 12 2 3 0 3 0.3898432 13 3 4 0 3 -0.6212406 14 4 5 1 3 -2.2146999 coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T) coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop %in% 4:5) ,robust=T) # the same !!! (except n) First, some data is lost. Second, this loss could be an important problem when there is a time-varying covariate that changes quicker than the frequency of events. Specifically, I have a covariate which has low values most of the time. It sometimes jumps to high values and that is hypothesized as greatly increasing the risk of an event. With rare events, the effect of this covariate will only be measured at event times. Chances are that the only time such a covariate is recorded at high level, the individual for which it is measured as being high is having an event. This may bias the estimated coefficient. Here is my question: How to fully use the dataset? (that is: how to have really _time-varying_ covariates (even if they change step by step, not continuously), not covariates whose changes are measured only at event time ) Ideally, the full dataset would be use to estimate the parameters, or at least to estimate the standard error of the estimated parameters. Any ideas ??? . . . A second best (which might require less work) would be to use all the dataset to assess the predictive power of the model. Maybe by using the expected number of events for an individual over the time interval that they were observed to be at risk predict(coxfit,type=expected) and compare it with observed number of events (does it use all data and takes into account all the baseline hazard, even between events?) Or, if not, following Brian D. Ripley suggestion about the baseline hazard: As an approximation you can smooth the fitted baseline cumulative hazard (e.g. by package pspline) and ask for its derivative (https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html) the following code could be use to estimate (and plot) a smooth baseline hazard: t-seq(min(data$start),max(data$stop),length=100) lines(t, predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2), t,1)) #there is a problem with this code. One should add the contraint that the baseline hazard cannot be negative. The following computes the parametric part of the cox model. risk - predict(coxfit, type='risk') # gives exp(X'b) something like baseline.hazard*risk would give the true risk at any time (but it would be probably much harder to compute) which could help assess the predictive power of the model. (still a lot of work) Thanks in advance for any help or comment. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied
Re: [R] coxph question
Chris, I understood the following: you want to try every single covariate in a cox model with only one covariate, then take the best ones according to p-value. Assume your columns look like: stop status event x1 x2 x3 etc You want to add column 3 (x1), then 4, etc. I suggest a for() loop: z-NULL;for(i in 3:ncol(data)) {coxtmp - coxph(Surv(stop,status)~ data[,i]) #you can modify the formula for #adding covariates in any case, for instance beta-coxtmp$coefficients[1] se-sqrt(diag(coxtmp$var))[1] z- rbind(z,c(i,beta,se,pvalue=signif(1 - pchisq((beta/ se)^2, 1), 4))) print (i)} #then select the covariates according to the p values: z[z$pvalue.01,i] Hope it helps. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France - Original Message - Thanks Mayeul, I actually would like to test each variable individually and use those have low p-value to build a classifier (not in cox model). Therefore, I need to write a function to subset those low p-value variables, instead of putting them as covariates. Any ideas? Chris -Original Message- I have many variables to test using cox model (coxph), and I am only interested in those variables with p value less than 0.01. Is there a quick way to do this automatically instead of looking at the output of each variable? Chris __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?
Hello, coxph does not use any information that are in the dataset between event times (or death times) , since computation only occurs at event times. For instance, removing observations when there is no event at that time in the whole dataset does not change the results: set.seed(1) data - as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep( 0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14))) data start stop status id x1 1 1 2 0 1 -0.6264538 2 2 3 0 1 0.1836433 3 3 4 0 1 -0.8356286 4 4 5 0 1 1.5952808 5 5 6 0 1 0.3295078 6 1 2 0 2 -0.8204684 7 2 3 0 2 0.4874291 8 3 4 1 2 0.7383247 9 4 5 0 2 0.5757814 10 5 6 0 2 -0.3053884 11 1 2 0 3 1.5117812 12 2 3 0 3 0.3898432 13 3 4 0 3 -0.6212406 14 4 5 1 3 -2.2146999 coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T) coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop %in% 4:5) ,robust=T) # the same !!! (except n) First, some data is lost. Second, this loss could be an important problem when there is a time-varying covariate that changes quicker than the frequency of events. Specifically, I have a covariate which has low values most of the time. It sometimes jumps to high values and that is hypothesized as greatly increasing the risk of an event. With rare events, the effect of this covariate will only be measured at event times. Chances are that the only time such a covariate is recorded at high level, the individual for which it is measured as being high is having an event. This may bias the estimated coefficient. Here is my question: How to fully use the dataset? (that is: how to have really _time-varying_ covariates (even if they change step by step, not continuously), not covariates whose changes are measured only at event time ) Ideally, the full dataset would be use to estimate the parameters, or at least to estimate the standard error of the estimated parameters. Any ideas ??? . . . A second best (which might require less work) would be to use all the dataset to assess the predictive power of the model. Maybe by using the expected number of events for an individual over the time interval that they were observed to be at risk predict(coxfit,type=expected) and compare it with observed number of events (does it use all data and takes into account all the baseline hazard, even between events?) Or, if not, following Brian D. Ripley suggestion about the baseline hazard: As an approximation you can smooth the fitted baseline cumulative hazard (e.g. by package pspline) and ask for its derivative (https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html) the following code could be use to estimate (and plot) a smooth baseline hazard: t-seq(min(data$start),max(data$stop),length=100) lines(t, predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2), t,1)) #there is a problem with this code. One should add the contraint that the baseline hazard cannot be negative. The following computes the parametric part of the cox model. risk - predict(coxfit, type='risk') # gives exp(X'b) something like baseline.hazard*risk would give the true risk at any time (but it would be probably much harder to compute) which could help assess the predictive power of the model. (still a lot of work) Thanks in advance for any help or comment. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] coxph question
I have many variables to test using cox model (coxph), and I am only interested in those variables with p value less than 0.01. Is there a quick way to do this automatically instead of looking at the output of each variable? Chris I guess you need covariate selection. for a lengthy discussion of another method (AIC/BIC), look at last month archive: https://www.stat.math.ethz.ch/pipermail/r-help/2004-July/subject.html#53519 and try using library(MASS) stepAIC (...) Most of the time, it starts removing the covariate with the lower p-value. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] proximity covariate to model dependence in cox model (question from Vito Muggeo)
(I received a messsage from Vito Muggeo [who sometimes posts here] about my previous posts here. Thus, it might be of interest here) dear Mayeul, I read your post on R -list about survival analysis with multiple (two if I am not right) events per subject. Sometimes subjetcs have even 3 events (i.e. civil wars for last 40 years) 1)How did you calculate the pxcw variable? (Note: pxcm stands for ProXimity.of.lastCivil.War) the answer is at https://www.stat.math.ethz.ch/pipermail/r-help/2004-July/053556.html after the words I used fit$loglik to chose a.chosen.parameter from 8 values, for 3 types of events: I used (exp(-days.since.event.of.type.one/a1)), which gives an indicator between 1 and 0 that falls quickly first then more slowly (see the end of the above post, 053556.html) Simply only use event.of.type.one. I am not doing a competing risk analysis: I am only counting one type of event (type.one), but the covariates measure other events that may increase the risk. The difficulty is to choose the good a1 parameter. You will typically need values much lower than mine with your exapmle dataset. It then become: la-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose. Adapt it from z-NULL;for(a1 in la) {coxtmp - (coxph(Surv(start,stop,status)~ +I(exp(-days.since.event.of.type.one/a1)) + other.time.dependent.covariates +cluster(id) ,data=x,robust=T)) rbind(z,c(a1,coxtmp$wald.test, coxtmp$rscore, coxtmp$loglik, coxtmp$score))-z } z - data.frame(z) names(z) - c(a1,wald.test, rscore, NULLloglik,loglik, score) z[which.max(z$rscore),] z[which.max(z$loglik),] Namely, for instance,given the following data set (standard in multiple event formulation): data.frame(id=c(3,3,4,4,5),start=c(0,10,0,7,0),stop=c(10,15,7,20,9), + event=c(1,1,1,0,0),stratum=c(1,2,1,2,1)) id start stop event stratum 1 3 0 10 1 1 2 310 15 1 2 3 4 07 1 1 4 4 7 20 0 2 5 5 09 0 1 how the pxcw variable is computed for each *row* in the dataset? should be something like ifelse(stratum==1,0,(stop-start)), i.e. (0,5,0,13,0) , or am I wrong? It looks computationnaly OK, but strange to me.You can choose the function that matches best the dependence, maybe it is. But, assuming risk jumps just after an event then decreases slowly , I would rather use a very high value when there is no previous event, for instance: ifelse(stratum==1,100,(stop-start)) Otherwise, a very close event (yesterday) would nearly be coded like no previous event! or even better, truncate the decrease : ifelse(stratum==1,100,ifelse(stop-start100, 100,stop-start )) Here, after 100 days, it is like having no previous event. But you should divide your observations to account for the change in the proximity variable. Instead of: id start stop event stratum proxim 4 07 1 1100 4 7 20 0 213 use id start stop event stratum proxim 4 07 1 1100 4 78 0 21 4 89 0 22 . . 4 19 200 213 With the function I used, I used a covariate (days.since.event) which was coded 999 if no previous event, I need: exp((-days.since.event)/a1) Is this a your idea or did you find it elsewhere? could you give me any reference? The proximity covariate is from http://www.worldbank.org/research/conflict/papers/CivilPeace.pdf (But they do not use the counting process we use here. They only measure covariates for all countries when one country experiments an event. Thus, all the remaining is mine (I think)) Hope you can help me, regards, vito __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] covariate selection in cox model (counting process)
If you can get the conditional independence (martingaleness) then, yes, BIC is fine. One way to check might be to see how similar the standard errors are with and without the cluster(id) term. (Thank you again !, Thomas.) At first look, the values seemed very similar (see below, case 2). However, to check this without being too subjective, and without a specific test, I needed other values to assess the size of the differences: what is similar, what is not? == = CASE 1 I first estimated the model without modeling dependence: Call: coxph(formula = Surv(start, stop, status) ~ cluster(ccode) + pop + pib + pib2 + crois + instab.x1 + instab.autres, data = xstep) coef exp(coef) se(coef) robust se z p pop0.3606 1.434 0.09780.1182 3.05 2.3e-03 pib -0.5947 0.552 0.19520.1828 -3.25 1.1e-03 pib2 -0.4104 0.663 0.14520.1270 -3.23 1.2e-03 crois -0.0592 0.943 0.02450.0240 -2.46 1.4e-02 instab.x1 2.2059 9.079 0.46920.4097 5.38 7.3e-08 instab.autres 0.9550 2.599 0.47000.4936 1.93 5.3e-02 Likelihood ratio test=74 on 6 df, p=6.2e-14 n= 7286 There seems to be a strong linear relationship between standard errors (se, or naive se) and robust se. summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1)) Coefficients: Estimate Std. Error t value Pr(|t|) sqrt(diag(cox1$naive.var)) 0.961030.04064 23.65 2.52e-06 *** Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894 == = CASE 2 Then I added a variable (pxcw) measuring the proximity of the previous event (1pxcw0) n= 7286 coef exp(coef) se(coef) robust se z p pxcw 0.9063 2.475 0.42670.4349 2.08 3.7e-02 pop0.3001 1.350 0.10410.1295 2.32 2.0e-02 pib -0.5485 0.578 0.20140.1799 -3.05 2.3e-03 pib2 -0.4033 0.668 0.14500.1152 -3.50 4.6e-04 crois -0.0541 0.947 0.02360.0227 -2.38 1.7e-02 instab.x1 1.9649 7.134 0.48390.4753 4.13 3.6e-05 instab.autres 0.8498 2.339 0.46930.4594 1.85 6.4e-02 Likelihood ratio test=78.3 on 7 df, p=3.04e-14 n= 7286 Estimate Std. Error t value Pr(|t|) sqrt(diag(cox1$naive.var)) 0.983970.02199 44.74 8.35e-09 *** Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965 The naive standard errors (se) seem closer to the robust se than they were when not modeling for dependence. 0.98397 is very close to one, R^2 grew, etc. The dependence is high (risk is multiplied by 2.475 the day after an event) but conditional independence (given covariates) seems hard to reject. == = CASE 3 Finally, I compared these results with those without repeated events (which gives a smaller dataset). A country is removed as soon as we observe its first event. (robust se is still computed, even if naive se should in fact be used here to compute the pvalue) coxph(formula = Surv(start, stop, status) ~ cluster(ccode) + pop + pib + pib2 + crois + instab.x1 + instab.autres, data = xstep[no.previous.event, ]) coef exp(coef) se(coef) robust se z p pop0.4236 1.528 0.10300.1157 3.66 2.5e-04 pib -0.7821 0.457 0.20720.1931 -4.05 5.1e-05 pib2 -0.3069 0.736 0.14770.1254 -2.45 1.4e-02 crois -0.0432 0.958 0.02810.0258 -1.67 9.5e-02 instab.x1 1.9925 7.334 0.53210.3578 5.57 2.6e-08 instab.autres 1.3571 3.885 0.54280.5623 2.41 1.6e-02 Likelihood ratio test=66.7 on 6 df, p=1.99e-12 n=5971 (2466 observations deleted due to missing) summary(lm(sqrt(diag(cox1$var))~ sqrt(diag(cox1$naive.var)) -1)) Estimate Std. Error t value Pr(|t|) sqrt(diag(cox1$naive.var)) 0.866820.07826 11.08 0.000104 *** Residual standard error: 0.06328 on 5 degrees of freedom Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953 There seems to be no evidence that robust se is more different from se in case 2 than in case 3 (and case 1). It even seems closer. I conclude that conditional independence (martingaleness) cannot be rejected in CASE 2, when modeling the dependence between events with a covariate. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France Then, there is still another option. In fact, I already modelled explicitely the influence of past events with a proximity of last event covariate, assuming the dependence on the last event decreases at a constant rate (for instance, the proximity covariate varies from 1 to 0.5 in the first 10 years after an event, then from 0.5 to 0.25 in the next ten years
[R] BIC vz SBIC vz SIC
2) question: alwasy on BIC, from stepAIC() function help page I found a k=log(n) argument to add. Since that produce an error, is there a way to found the n dinamically? stepAIC(mydata.logistic, trace = F, k=log(nrow(mydata))) -- Daniele Medri (It was 3 weeks ago but I was just myself faced to the same question) Just a little warning, it is not really dynamically done: stepAIC takes the part of mydata that was used to fit mydata.logistic but k=log(nrow(mydata)) will still use the entire set. thus, if there were some missing data in mydata, I suggest you use something like: k=log(sum(complete.cases(mydata))) which will be smaller than k=log(nrow(mydata)). If some data are missing only for a covariate and this one is withdrawn, you will have a warning and it will stop: Error in stepAIC(mydata.logistic (etc.) number of rows in use has changed: remove missing values? Since (I presume) these criteria need to compare two models with identical dataset. In any other case, if you have NAs at the start but the number of NAs does not change in the process, you won't be warned if you use k=log(nrow(mydata)). Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Reading and treating multiple files....
I will get - in a few days- a set of separate files(one for each records~80'000 files pro year) named key.txt in which other info (measurements of pressure and velocity at each second ) is stocked; typically each one of this separate files will hold something between 200 and 1500 recordsI'm supposed to do statistical analysis on this dataset, on the yearly information Well, as you can suspect my problem is thus: - batch processing of these individual files :reading each file (automatically!), I suggest you put all files in a single directory, then you use list.files () or dir() to make a list of this files. myfiles - list.files (c:\\mydir) #on windows Then you do a loop over these files with read.table() You can create a column named file with the name of the file. All of this can be done with: data - NULL;for (i in dir(c:\\temp)) data - rbind(data,cbind(file=i,read.table( paste(c:\\mydir\\,i,sep= Hope that helps Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] covariate selection in cox model (counting process)
No, I mean recurrent events. With counting process notation but no recurrent revents the partial likelihood is still valid, and the approach of treating it as a real likelihood for AIC (and presumably BIC) makes sense. Roughly speaking, you can't tell there is dependence until you see multiple events. Thanks a lot, I got it (well, I hope so)! I've read in several places that events in the Andersen-Gill model must be conditionnaly independent, which is sometimes more precisely written as conditionnaly independent given the covariates or even more precisely: the Andersen-Gill (AG) model assumes that each [individual] has a multi-event counting process with independent increments. The observed increments must be conditionally independent given the history of all observable information up to the event times. (http://www.stat.umu.se/egna/danardono/licdd.pdf) Then, there is still another option. In fact, I already modelled explicitely the influence of past events with a proximity of last event covariate, assuming the dependence on the last event decreases at a constant rate (for instance, the proximity covariate varies from 1 to 0.5 in the first 10 years after an event, then from 0.5 to 0.25 in the next ten years, etc). With a well chosen modelisation of the dependence effect, the events become conditionnaly independent, I do not need a +cluster(id) term, and I can use fit$loglik to make a covariate selection based on BIC, right? Thanks a lot again for your time. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France PS: I wrongly concluded from the R statement (Note: the likelihood ratio and score tests assume independence of observations within a cluster, the Wald and robust score tests do not). that it meant independence between two consecutive observations (without any event). It made sense to me because when only one covariate changes for a given individual, and with a small change, there is a new observation, with a risk very simlar to the risk for the previous observation. But there is still independence with respect to the question of recurrent event. Maybe the warning should be rewritten saying assume *conditionnal* independence of *events* (given the covariates) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] covariate selection in cox model (counting process)
from 1962 to 1997 (36 years), I have the folowing numbers of complete cases-equivalent: sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36) [1] 104.4316 This seems more resonable and would account for NAs in different models. However, it might be to high, because some countries are not at risk during the all period: some did not existed because they gain independence near the end of the period (E.G. ex-USRR countries in arly 1990's) or because they were experiencing an event (new wars in countries already experiencing a war are not taken into account). I may take the *proportion* of available data to time at risk to adjust for this: a country at risk during 1 year and for which data are available for this entire year will increase n by 1, not by 1/36. If the data frame dataset contains all countries at risk (including some with NA), assuming id == id[complete.cases(dataset)] (all countries have at least one complete observation) this will be computed by attach(dataset) sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)] ,CCO DE,sum) /tapply(stop-start,CCODE,sum)) But this would be a rather empirical adjustment, maybe with no theoretical basis. And I don't think I can enter this as argument k to step() Thank you for having read this. Hope I was not too long. And thank you a lot for any help, comment, etc. (sorry for mistakes in English as I'm a non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: smooth non cumulative baseline hazard in Cox model
from 1962 to 1997 (36 years), I have the folowing numbers of complete cases-equivalent: sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36) [1] 104.4316 This seems more resonable and would account for NAs in different models. However, it might be to high, because some countries are not at risk during the all period: some did not existed because they gain independence near the end of the period (E.G. ex-USRR countries in arly 1990's) or because they were experiencing an event (new wars in countries already experiencing a war are not taken into account). I may take the *proportion* of available data to time at risk to adjust for this: a country at risk during 1 year and for which data are available for this entire year will increase n by 1, not by 1/36. If the data frame dataset contains all countries at risk (including some with NA), assuming id == id[complete.cases(dataset)] (all countries have at least one complete observation) this will be computed by attach(dataset) sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)],CCO DE,sum) /tapply(stop-start,CCODE,sum)) But this would be a rather empirical adjustment, maybe with no theoretical basis. And I don't think I can enter this as argument k to step() Thank you for having read this. Hope I was not too long. And thank you a lot for any help, comment, etc. (sorry for mistakes in English as I'm a non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] covariate selection in cox model (counting process)
Thank you a lot for your time and your answer, Thomas. Like all good answers, it raised new questions for me ;-) In the case of recurrent events coxph() is not using maximum likelihood or even maximum partial likelihood. It is maximising the quantity that (roughly speaking) would be the partial likelihood if the covariates explained all the cluster differences. I could have non repeating events by removing countries once they have experienced a war. But I'm not sure it will change the estimation procedure since this will change the dataset only, not the formula coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T) I am not sure I understood you well: do you really mean recurrent events alone or any counting process notation (including allowing for recurrent events). I thought the counting process notation did not differ really from the Cox model in R, since Terry M. Therneau (A Package for Survival Analysis in S, April 22, 1996) concludes his mathematical section 3.3 Cox Model by The above notation is derived from the counting process representation [...] It allows very naturally for several extensions to the original Cox model formulation: multiple events per subject, discontinuous intervals of risk [...],left truncation. (I used it to introduce 1. time-dependent covariates, some covariates changing yearly, other irregularly, and 2. left truncation: not all countries existed at the beginning of the study) In the case of recurrent events coxph() is not using maximum likelihood or even maximum partial likelihood. Then, what does fit$loglik give in this case? Still a likelihood or a valid criterion to maximise ? If not, how to get (manually) the criterion that was maximsed? That's of interest for me since I created artificial covariates measuring the proximity since some events: exp(-days.since.event/a.chosen.parameter). ...and I used fit$loglik to chose a.chosen.parameter from 8 values, for 3 types of events: la-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose from z-NULL;for(a1 in la) for(a2 in la) for(a3 in la) {coxtmp - (coxph(Surv(start,stop,status)~ +I(exp(-days.since.event.of.type.one/a1)) +I(exp(-days.since.event.of.type.two/a2)) +I(exp(-days.since.event.of.type.three/a3)) + other.time.dependent.covariates +cluster(id) ,data=x,robust=T)) rbind(z,c(a1,a2,a3,coxtmp$wald.test, coxtmp$rscore, coxtmp$loglik, coxtmp$score))-z } z - data.frame(z) names(z) - c(a1,a2, a3,wald.test, rscore, NULLloglik,loglik, score) z[which.max(z$rscore),] z[which.max(z$loglik),] The last two commands gave me almost always the same set for c(a1,a2,a3). But they sometimes differed significantly on some models. Which criteria (if any ?!) should I use to select the best set c(a1,a2,a3) ? (If you wish to see what the proximity variables look like, run the following code. The dashed lines show the half life of the proximity variable,here=6 months, which is determined by a.chosen.parameter, e.g. a1=la[1]: #start of code curve(exp(-(x)/263.5),0,8*365.25,xlab=number of days since last political regime change (dsrc),ylab=Proximity of political regime change = exp(-dsrc/263.5),las=1) axis(1,at=365.25/2, labels= (6 months));axis(2,at=seq(0,1,.1),las=1) lines(c(365.25/2,365.25/2,-110),c(-.05,0.5,0.5),lty=dashed) #end of code) Thanks a lot again. Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] smooth non cumulative baseline hazard in Cox model
Hi everyone. There's been several threads on baseline hazard in Cox model but I think they were all on cumulative baseline hazard, for instance http://tolstoy.newcastle.edu.au/R/help/01a/0464.html http://tolstoy.newcastle.edu.au/R/help/01a/0436.html basehaz in package survival seems to do a cumulative hazard. extract from the basehaz function: sfit - survfit(fit) H - -log(sfit$surv) Since sfit$surv is monotonic, H will be monotonic too, which makes me think it is a cumulative function. I think H(t) it is the sum (integration) of lambda's from 0 to t (Am I right?) What I need might be lambda(t) or lambda(t)dt (I do not know for sure), something involving the instantaneous baseline risk. But, for sure, I 've seen elsewhere what I need. Specifically, here: http://econ.worldbank.org/files/13214_CivilPeace.pdf that is page 41 of HEGRE, Håvard; ELLINGSEN, Tanja; GATES, Scott; GLEDITSCH, Nils Petter (2001). Toward a Democratic Civil Peace? Democracy, Political Change, and Civil War, 1816-1992. American Political Science Review, vol. 95, no 1. I'm doing the same job as Hegre et al. (studying civil wars) but with the counting process formulation of the Cox model. (I use intervals, my formula looks like Surv(start,stop,status)~ etc.). Like Hegre and alii (who use the stata software) I would like to have a curve showing what is the (instantaneous) overall risk of civil war at a given time, taking away the effect of the covariates. For those who also use the SAS software (which I'm not, unfortunately), the job I need to be done seems to be done by the SMOOTH macro described in Survival Analysis Using the SAS System: A Practical Guide by Paul D. Allison (see below). There is a graph (output 5.22 smoothes hazard functions for two financial aid groups) P. 170 of his book which shows another example (except I only need it for 1 group, at mean values) I hope I am clear enough. Thank you a lot for any help. (sorry for mistakes in English as I'm on non native English speaker) Mayeul KAUFFMANN Univ. Pierre Mendes France Grenoble - France SMOOTH Macro (SAS) %macro smooth (data=_last_, time=, width=, survival=survival); /* MACRO SMOOTH produces graphs of smoothed hazard functions using output from either PROC LIFETEST or PROC PHREG. With PROC LIFETEST, it uses the data set produced by the OUTSURV option in the PROC statement. With PROC PHREG, it uses the data set produced by the BASELINE statement. SMOOTH employs a kernel smoothing method described by H. Ramlau-Hansen (1983), Smoothing Counting Process Intensities by Means of Kernel Functions, The Annals of Statistics 11, 453-466. If there is more than one survival curve in the input data set, SMOOTH will produce multiple smoothed hazard curves on the same axes. There are four parameters: DATA is the name of the data set containing survivor function estimates. The default is the most recently created data set. TIME is name of the variable containing event times. SURVIVAL is the name of a variable containing survivor function estimates (the default is SURVIVAL, which is the automatic name in PROC LIFETEST). WIDTH is bandwidth of smoothing function. The default is 1/5 of the range of event times. Example of usage: %smooth(data=my.data,time=duration,width=8,survival=s) Author: Paul D. Allison, University of Pennsylvania [EMAIL PROTECTED] */ data _inset_; set data end=final; retain _grp_ _censor_ 0; t=time; survival=survival; if t=0 and survival=1 then _grp_=_grp_+1; keep _grp_ t survival; if final and _grp_ 1 then call symput('nset','yes'); else if final then call symput('nset','no'); if _censor_ = 1 then delete; if survival in (0,1) then delete; run; proc iml; use _inset_; read all var {t _grp_}; %if width ne %then %let w2=width; %else %let w2=(max(t)-min(t))/5; w=w2; z=char(w,8,2); call symput('width',z); numset=max(_grp_); create _plt_ var{ lambda s group}; setin _inset_ ; do m=1 to numset; read all var {t survival _grp_} where (_grp_=m); n=nrow(survival); lo=t[1] + w; hi=t[n] - w; npt=50; inc=(hi-lo)/npt; s=lo+(1:npt)`*inc; group=j(npt,1,m); slag=1//survival[1:n-1]; h=1-survival/slag; x = (j(npt,1,1)*t` - s*j(1,n,1))/w; k=.75*(1-x#x)#(abs(x)=1); lambda=k*h/w; append; end; quit; %if nset = yes %then %let c==group; %else %let c=; proc gplot data=_plt_; plot lambda*s c / vaxis=axis1 vzero haxis=axis2; axis1 label=(angle=90 f=titalic 'Hazard Function' ) minor=none ; axis2 label=(f=titalic Time (bandwidth=width)) minor=none; symbol1 i=join color=black line=1; symbol2 i=join color=red line=2; symbol3 i=join color=green line=3; symbol4 i=join color=blue line=4; run; quit; %mend smooth; __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] smooth non cumulative baseline hazard in Cox model
Thank you all for your quick answers. With respect to my question on smooth noncumulative baseline cox hazard, I followed Prof Brian Ripley and I used the following: library(survival) plot(basehaz(coxfinal2)[,2]/365.25+1945,basehaz(coxfinal2)[,1],t=l) xx - seq(min(basehaz(coxfinal2)[,2]/365.25+1945),max(basehaz(coxfinal2)[,2]/365.2 5+1945),length=100) #my start value was 1st january 1945 library(pspline) lines(xx, predict(sm.spline(x=basehaz(coxfinal2)[,2]/365.25+1945,y=basehaz(coxfinal2)[ ,1],norder=2), xarg=xx,nderiv=1)) it might seem that computing the derivative when time is expressed in years gives the annual probability of event. The previous commands give a graphic exactly identical to: plot(basehaz(coxfinal2)[,2],basehaz(coxfinal2)[,1],t=l) xx - seq(min(basehaz(coxfinal2)[,2]),max(basehaz(coxfinal2)[,2]),length=100) lines(xx, 365.25*predict(sm.spline(x=basehaz(coxfinal2)[,2],y=basehaz(coxfinal2)[,1],n order=2), xarg=xx,nderiv=1)) # [second command] However, if p is the probability of event for the 1st day of a given year, it is not obvious to me that the probability that there is one event for the 1st year equals 365*p. Am I mistaken? If no, what does the second command computes? So if someone can help me say what is the time unit for the risk shown by lines(xx, predict(sm.spline(x=basehaz(coxfinal2)[,2]/365.25+1945,y=basehaz(coxfinal2)[ ,1],norder=2), xarg=xx,nderiv=1)) ... @@@ With respect to censoring, I think we all agree: Peter Dalgaard wrote: Prof Brian Ripley [EMAIL PROTECTED] writes: I'm doing the same job as Hegre et al. (studying civil wars) but with the counting process formulation of the Cox model. (I use intervals, my formula looks like Surv(start,stop,status)~ etc.). Careful, that is left- and right- censored, not intervals. Surv has a type= argument. Nitpick: That's left-*truncated* and right-censored (the status refers to the condition at the right end, people who die before the start are not registered at all). I use the following dataset: idstartstopstatus ... covariates 113650... 13654001... [the war starts at 400 and ens at 550] 15507300... [there are possibly repeated events so the country re-enters the study] 213650... 23657300... etc... where there is one id for every country, that is several lines for each country (each line thus representing an interval of time). with coxph(Surv(start, stop, status, type = interval) ~ x1+...+cluster(id) I did not meant interval censoring (althought I think it is present here for country 1 from time 400 to 550), I meant interval in the same meaning as in the R help for Surv: time2ending time of the interval for interval censored or counting process data only. Intervals are assumed to be open on the left and closed on the right, (start, end]. For counting process data, event indicates whether an event occurred at the end of the interval. Surv has a type= argument. Yes, and the help says The default is right or counting depending on whether the time2 argument is absent or present, respectively. Here, I omited the type, which means I used a counting process. Thus, the union of all intervals for country 2 (here, lines 4 and 5) lead to one big interval which is left truncated and right censored. Anyway, I think there is no ambiguity, since if one tries type = interval it says: Error in coxph(Surv(start, stop, status, type = interval) ~ Cox model doesn't support interval survival data But thanks to Prof. Ripley for the comment, as I am not fully aware of the exact terminology in English. Regards, Mayeul KAUFFMANN __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html