Re: [R] How to simulate informative censoring in a Cox PH model?
Hi Greg The copulas concept seems a nicely simple way of simulating event times that are subject to informative censoring (in contrast to the double cox model approach I use). The correlation between the marginal uniform random variables you speak of reminded me that my approach should also induce this correlation, just in a different way. Similarly I should also observe zero correlation between my event times from my outcome model and the censoring times. Unfortunately this was not the case - to cut a long story short I was inadvertently generating my independent censoring times from a model that depended on covariates in the outcome model. This now explains the mixed results I rather laboriously attempted to describe previously. Re-running some scenarios with my new error-free code I can now clearly observe the points you have been making, that is informative censoring only leads to bias if the covariates in the censoring model are not in the outcome model. Indeed I can choose the common (to both models) treatment effect to be vastly different (with all other effects the same) and have no bias, yet small differences in the censoring Z effect (not in the outcome model) effect lead to moderate biases. I am still somewhat confused at the other approach to this problem where I have seen in various journal articles authors assuming an outcome model for the censored subjects - i.e. an outcome model for the unobserved event times. Using this approach the definition of informative censoring appears to be where the observed and un-observed outcome models are different. This approach also makes sense to me - censoring merely loses precision of the parameter estimators due to reduced events, but does not lead to bias. However the concept of correlated event and censoring times does not even present itself here? Thanks Dan On Fri, Jul 31, 2015 at 5:06 PM, Greg Snow 538...@gmail.com wrote: Daniel, Basically just responding to your last paragraph (the others are interesting, but I think that you are learning as much as anyone and I don't currently have any other suggestions). I am not an expert on copulas, so this is a basic understanding, you should learn more about them if you choose to use them. The main idea of a copula is that it is a bivariate or multivariate distribution where all the variables have uniform marginal distributions but the variables are not independent from each other. How I would suggest using them is to choose a copula and generate random points from a bivariate copula, then put those (uniform) values into the inverse pdf function for the Weibull (or other distribution), one of which is the event time, the other the censoring time. This will give you times that (marginally) come from the distributions of interest, but are not independent (so would be considered informative censoring). Repeat this with different levels of relationship in the copula to see how much difference it makes in your simulations. On Thu, Jul 30, 2015 at 2:02 PM, Daniel Meddings dpmeddi...@gmail.com wrote: Thanks Greg once more for taking the time to reply. I certainly agree that this is not a simple set-up, although it is realistic I think. In short you are correct about model mis-specification being the key to producing more biased estimates under informative than under non-informative censoring. After looking again at my code and trying various things I realize that the key factor that leads to the informative and non-informative censoring data giving rise to the same biased estimates is how I generate my Z_i variable, and also the magnitude of the Z_i coefficient in both of the event and informative censoring models. In the example I gave I generated Z_i (I think of this as a poor prognosis variable) from a beta distribution so that it ranged from 0-1. The biased estimates for beta_t_1 (I think of this as the effect of a treatment on survival) were approximately 1.56 when the true value was -1. What I forgot to mention was that estimating a cox model with 1,000,000 subjects to the full data (i.e. no censoring at all) arguably gives the best treatment effect estimate possible given that the effects of Z_i and Z_i*Treat_i are not in the model. This best possible estimate turns out to be 1.55 - i.e. the example I gave just so happens to be such that even with 25-27% censoring, the estimates obtained are almost the best that can be attained. My guess is that the informative censoring does not bias the estimate more than non-informative censoring because the only variable not accounted for in the model is Z_i which does not have a large enough effect beta_t_2, and/or beta_c_2, or perhaps because Z_i only has a narrow range which does not permit the current beta_t_2 value to do any damage? To investigate the beta_t_2, and/or beta_c_2 issue I changed beta_c_2 from 2 to 7 and beta_c_0 from 0.2 to -1.2, and beta_d_0 from
Re: [R] How to simulate informative censoring in a Cox PH model?
Daniel, Basically just responding to your last paragraph (the others are interesting, but I think that you are learning as much as anyone and I don't currently have any other suggestions). I am not an expert on copulas, so this is a basic understanding, you should learn more about them if you choose to use them. The main idea of a copula is that it is a bivariate or multivariate distribution where all the variables have uniform marginal distributions but the variables are not independent from each other. How I would suggest using them is to choose a copula and generate random points from a bivariate copula, then put those (uniform) values into the inverse pdf function for the Weibull (or other distribution), one of which is the event time, the other the censoring time. This will give you times that (marginally) come from the distributions of interest, but are not independent (so would be considered informative censoring). Repeat this with different levels of relationship in the copula to see how much difference it makes in your simulations. On Thu, Jul 30, 2015 at 2:02 PM, Daniel Meddings dpmeddi...@gmail.com wrote: Thanks Greg once more for taking the time to reply. I certainly agree that this is not a simple set-up, although it is realistic I think. In short you are correct about model mis-specification being the key to producing more biased estimates under informative than under non-informative censoring. After looking again at my code and trying various things I realize that the key factor that leads to the informative and non-informative censoring data giving rise to the same biased estimates is how I generate my Z_i variable, and also the magnitude of the Z_i coefficient in both of the event and informative censoring models. In the example I gave I generated Z_i (I think of this as a poor prognosis variable) from a beta distribution so that it ranged from 0-1. The biased estimates for beta_t_1 (I think of this as the effect of a treatment on survival) were approximately 1.56 when the true value was -1. What I forgot to mention was that estimating a cox model with 1,000,000 subjects to the full data (i.e. no censoring at all) arguably gives the best treatment effect estimate possible given that the effects of Z_i and Z_i*Treat_i are not in the model. This best possible estimate turns out to be 1.55 - i.e. the example I gave just so happens to be such that even with 25-27% censoring, the estimates obtained are almost the best that can be attained. My guess is that the informative censoring does not bias the estimate more than non-informative censoring because the only variable not accounted for in the model is Z_i which does not have a large enough effect beta_t_2, and/or beta_c_2, or perhaps because Z_i only has a narrow range which does not permit the current beta_t_2 value to do any damage? To investigate the beta_t_2, and/or beta_c_2 issue I changed beta_c_2 from 2 to 7 and beta_c_0 from 0.2 to -1.2, and beta_d_0 from -0.7 to -0.4 to keep the censoring %'s equal at about 30%. This time the best possible estimate of beta_t_1 was -1.53 which was similar to that obtained previously. The informative censoring gave an estimate for beta_t_1 of -1.49 whereas the non-informative censoring gave -1.53 - this time the non-informative censoring attains the best possible but the non-informative censoring does not. I then instead changed beta_t_2 from 1 to 7 and beta_c_0 from 0.2 to 2, and beta_d_0 from -0.7 to -1.9 again to keep the censoring %'s equal at about 30%. This time the best possible estimate of beta_t_1 was -0.999 which is pretty much equal to the true value of -1. The informative censoring gave an estimate for beta_t_1 of -1.09 whereas the non-informative censoring gave -0.87 – surprisingly this time the informative censoring is slightly closer to the “best possible” than the non-informative censoring. To investigate the Z_i issue I generated it from a normal distribution with mean 1 and variance 1. I changed beta_c_0 from 0.2 to -0.5 to keep the censoring levels equal (this time about 30% for both). This time the best possible estimate was -1.98 which was further from -1 than previous examples. The informative censoring gave an estimate for beta_t_1 of -1.81 whereas the non-informative censoring gave -1.84. So again the informative censoring gives an estimate closer to the best possible when compared with the informative censoring, but this time it does not attain the best possible. In conclusion it is clear to me that a stronger Z_i effect in the censoring model causes the informative censoring to be worse than the non-informative one (as expected), but a stronger Z_i effect in the event model does not have this effect and even causes the independent censoring to be worse – this in general may not hold but I nonetheless see it here. I am wondering if this is because altering the treatment effect in the event model also
Re: [R] How to simulate informative censoring in a Cox PH model?
Thanks Greg once more for taking the time to reply. I certainly agree that this is not a simple set-up, although it is realistic I think. In short you are correct about model mis-specification being the key to producing more biased estimates under informative than under non-informative censoring. After looking again at my code and trying various things I realize that the key factor that leads to the informative and non-informative censoring data giving rise to the same biased estimates is how I generate my Z_i variable, and also the magnitude of the Z_i coefficient in both of the event and informative censoring models. In the example I gave I generated Z_i (I think of this as a poor prognosis variable) from a beta distribution so that it ranged from 0-1. The biased estimates for beta_t_1 (I think of this as the effect of a treatment on survival) were approximately 1.56 when the true value was -1. What I forgot to mention was that estimating a cox model with 1,000,000 subjects to the full data (i.e. no censoring at all) arguably gives the best treatment effect estimate possible given that the effects of Z_i and Z_i*Treat_i are not in the model. This best possible estimate turns out to be 1.55 - i.e. the example I gave just so happens to be such that even with 25-27% censoring, the estimates obtained are almost the best that can be attained. My guess is that the informative censoring does not bias the estimate more than non-informative censoring because the only variable not accounted for in the model is Z_i which does not have a large enough effect beta_t_2, and/or beta_c_2, or perhaps because Z_i only has a narrow range which does not permit the current beta_t_2 value to do any damage? To investigate the beta_t_2, and/or beta_c_2 issue I changed beta_c_2 from 2 to 7 and beta_c_0 from 0.2 to -1.2, and beta_d_0 from -0.7 to -0.4 to keep the censoring %'s equal at about 30%. This time the best possible estimate of beta_t_1 was -1.53 which was similar to that obtained previously. The informative censoring gave an estimate for beta_t_1 of -1.49 whereas the non-informative censoring gave -1.53 - this time the non-informative censoring attains the best possible but the non-informative censoring does not. I then instead changed beta_t_2 from 1 to 7 and beta_c_0 from 0.2 to 2, and beta_d_0 from -0.7 to -1.9 again to keep the censoring %'s equal at about 30%. This time the best possible estimate of beta_t_1 was -0.999 which is pretty much equal to the true value of -1. The informative censoring gave an estimate for beta_t_1 of -1.09 whereas the non-informative censoring gave -0.87 – surprisingly this time the informative censoring is slightly closer to the “best possible” than the non-informative censoring. To investigate the Z_i issue I generated it from a normal distribution with mean 1 and variance 1. I changed beta_c_0 from 0.2 to -0.5 to keep the censoring levels equal (this time about 30% for both). This time the best possible estimate was -1.98 which was further from -1 than previous examples. The informative censoring gave an estimate for beta_t_1 of -1.81 whereas the non-informative censoring gave -1.84. So again the informative censoring gives an estimate closer to the best possible when compared with the informative censoring, but this time it does not attain the best possible. In conclusion it is clear to me that a stronger Z_i effect in the censoring model causes the informative censoring to be worse than the non-informative one (as expected), but a stronger Z_i effect in the event model does not have this effect and even causes the independent censoring to be worse – this in general may not hold but I nonetheless see it here. I am wondering if this is because altering the treatment effect in the event model also affects the independent censoring process and so it “muddies the waters” whereas altering the treatment effect in the informative censoring model obviously confines the changes to just the informative censoring process. For a fixed treatment effect size in both the event and informative censoring models the effect of Z_i having a wider range than is possible under the beta distribution also appears to produce informative censoring that is worse than the non-informative one. This makes sense I think because the Z_i-response relationship must be more informative? Thanks for your suggestion of copulas – I have not come across these. Is this similar to assuming a event model for censored subjects (this is unobserved) – i.e. if the event model is different conditional on censoring then if we could observe the events beyond censoring then clearly the parameter estimates would be different compared to those obtained when modelling only non-censored times? On Wed, Jul 29, 2015 at 5:37 PM, Greg Snow 538...@gmail.com wrote: As models become more complex it becomes harder to distinguish different parts and their effects. Even for a straight forward linear regression model if X1 and
Re: [R] How to simulate informative censoring in a Cox PH model?
As models become more complex it becomes harder to distinguish different parts and their effects. Even for a straight forward linear regression model if X1 and X2 are correlated with each other then it becomes difficult to distinguish between the effects of X1^2, X2^2, and X1*X2. In your case the informative censoring and model misspecification are becoming hard to distinguish (and it could be argued that having informative censoring is really just a form of model misspecification). So I don't think so much that you are doing things wrong, just that you are learning that the models are complex. Another approach to simulation that you could try is to simulate the event time and censoring time using copulas (and therefore they can be correlated to give informative censoring, but without relying on a term that you could have included in the model) then consider the event censored if the censoring time is shorter. On Fri, Jul 24, 2015 at 10:14 AM, Daniel Meddings dpmeddi...@gmail.com wrote: Hi Greg Many thanks for taking the time to respond to my query. You are right about pointing out the distinction between what variables are and are not included in the event times process and in the censoring process. I clearly forgot this important aspect. I amended my code to do as you advise and now I am indeed getting biased estimates when using the informatively censored responses. The problem is now that the estimates from the independently censored responses are the same - i.e. they are just as biased. Thus the bias seems to be due entirely to model mis-specification and not the informative censoring. To give a concrete example I simulate event times T_i and censoring times C_i from the following models; T_i~ Weibull(lambda_t(x),v_t),lambda_t(x)=lambda_t*exp( beta_t_0 + (beta_t_1*Treat_i) + (beta_t_2*Z_i) + (beta_t_3*Treat_i*Z_i) ) C_i~ Weibull(lambda_c(x),v_c),lambda_c(x)=lambda_c*exp( beta_c_0 + (beta_c_1*Treat_i) + (beta_c_2*Z_i) + (beta_c_3*Treat_i*Z_i) ) D_i~Weibull(lambda_d(x),v_D), lambda_d(x)=lamda_d*exp( beta_d_0) where ; beta_t_0 = 1, beta_t_1 = -1, beta_t_2 = 1, beta_t_3 = -2, lambda_t=0.5 beta_c_0 = 0.2, beta_c_1 = -2, beta_c_2 = 2, beta_c_3 = -2, lambda_c=0.5 beta_d_0 = -0.7, lambda_d=0.1 When I fit the cox model to both the informatively censored responses and the independent censored responses I include only the Treatment covariate in the model. I simulate Treatment from a Bernoulli distribution with p=0.5 and Z_i from a beta distribution so that Z ranges from 0 to 1 (I like to think of Z as a poor prognosis probability where Z_i=1 means a subject is 100% certain to have a poor prognosis and Z_i=0 means zero chance). These parameter choices give approximately 27% and 25% censoring for the informatively censored responses (using C_i) and the independent censored responses (using D_i) respectively. I use N=2000 subjects and 2000 simulation replications. The above simulation I get estimates of beta_t_2 of -1.526 and -1.537 for the informatively censored responses and the independent censored responses respectively. Furthermore when I fit a cox model to the full responses (no censoring at all) I get an estimate of beta_t_2 of -1.542. This represents the best that can possibly be done given that Z and Treat*Z are not in the model. Clearly censoring is not making much of a difference here - model mis-specification dominates. I still must be doing something wrong but I cannot figure this one out. Thanks Dan On Thu, Jul 23, 2015 at 12:33 AM, Greg Snow 538...@gmail.com wrote: I think that the Cox model still works well when the only information in the censoring is conditional on variables in the model. What you describe could be called non-informative conditional on x. To really see the difference you need informative censoring that depends on something not included in the model. One option would be to use copulas to generate dependent data and then transform the values using your Weibul. Or you could generate your event times and censoring times based on x1 and x2, but then only include x1 in the model. On Wed, Jul 22, 2015 at 2:20 AM, Daniel Meddings dpmeddi...@gmail.com wrote: I wish to simulate event times where the censoring is informative, and to compare parameter estimator quality from a Cox PH model with estimates obtained from event times generated with non-informative censoring. However I am struggling to do this, and I conclude rather than a technical flaw in my code I instead do not understand what is meant by informative and un-informative censoring. My approach is to simulate an event time T dependent on a vector of covariates x having hazard function h(t|x)=lambda*exp(beta'*x)v*t^{v-1}. This corresponds to T~ Weibull(lambda(x),v), where the scale parameter lambda(x)=lambda*exp(beta'*x) depends on x and the shape parameter v is fixed. I have N subjects where
Re: [R] How to simulate informative censoring in a Cox PH model?
Hi Greg Many thanks for taking the time to respond to my query. You are right about pointing out the distinction between what variables are and are not included in the event times process and in the censoring process. I clearly forgot this important aspect. I amended my code to do as you advise and now I am indeed getting biased estimates when using the informatively censored responses. The problem is now that the estimates from the independently censored responses are the same - i.e. they are just as biased. Thus the bias seems to be due entirely to model mis-specification and not the informative censoring. To give a concrete example I simulate event times T_i and censoring times C_i from the following models; T_i~ Weibull(lambda_t(x),v_t),lambda_t(x)=lambda_t*exp( beta_t_0 + (beta_t_1*Treat_i) + (beta_t_2*Z_i) + (beta_t_3*Treat_i*Z_i) ) C_i~ Weibull(lambda_c(x),v_c),lambda_c(x)=lambda_c*exp( beta_c_0 + (beta_c_1*Treat_i) + (beta_c_2*Z_i) + (beta_c_3*Treat_i*Z_i) ) D_i~Weibull(lambda_d(x),v_D), lambda_d(x)=lamda_d*exp( beta_d_0) where ; beta_t_0 = 1, beta_t_1 = -1, beta_t_2 = 1, beta_t_3 = -2, lambda_t=0.5 beta_c_0 = 0.2, beta_c_1 = -2, beta_c_2 = 2, beta_c_3 = -2, lambda_c=0.5 beta_d_0 = -0.7, lambda_d=0.1 When I fit the cox model to both the informatively censored responses and the independent censored responses I include only the Treatment covariate in the model. I simulate Treatment from a Bernoulli distribution with p=0.5 and Z_i from a beta distribution so that Z ranges from 0 to 1 (I like to think of Z as a poor prognosis probability where Z_i=1 means a subject is 100% certain to have a poor prognosis and Z_i=0 means zero chance). These parameter choices give approximately 27% and 25% censoring for the informatively censored responses (using C_i) and the independent censored responses (using D_i) respectively. I use N=2000 subjects and 2000 simulation replications. The above simulation I get estimates of beta_t_2 of -1.526 and -1.537 for the informatively censored responses and the independent censored responses respectively. Furthermore when I fit a cox model to the full responses (no censoring at all) I get an estimate of beta_t_2 of -1.542. This represents the best that can possibly be done given that Z and Treat*Z are not in the model. Clearly censoring is not making much of a difference here - model mis-specification dominates. I still must be doing something wrong but I cannot figure this one out. Thanks Dan On Thu, Jul 23, 2015 at 12:33 AM, Greg Snow 538...@gmail.com wrote: I think that the Cox model still works well when the only information in the censoring is conditional on variables in the model. What you describe could be called non-informative conditional on x. To really see the difference you need informative censoring that depends on something not included in the model. One option would be to use copulas to generate dependent data and then transform the values using your Weibul. Or you could generate your event times and censoring times based on x1 and x2, but then only include x1 in the model. On Wed, Jul 22, 2015 at 2:20 AM, Daniel Meddings dpmeddi...@gmail.com wrote: I wish to simulate event times where the censoring is informative, and to compare parameter estimator quality from a Cox PH model with estimates obtained from event times generated with non-informative censoring. However I am struggling to do this, and I conclude rather than a technical flaw in my code I instead do not understand what is meant by informative and un-informative censoring. My approach is to simulate an event time T dependent on a vector of covariates x having hazard function h(t|x)=lambda*exp(beta'*x)v*t^{v-1}. This corresponds to T~ Weibull(lambda(x),v), where the scale parameter lambda(x)=lambda*exp(beta'*x) depends on x and the shape parameter v is fixed. I have N subjects where T_{i}~ Weibull(lambda(x_{i}),v_{T}), lambda(x_{i})=lambda_{T}*exp(beta_{T}'*x_{i}), for i=1,...,N. Here I assume the regression coefficients are p-dimensional. I generate informative censoring times C_i~ Weibull(lambda(x_i),v_C), lambda(x_i)=lambda_C*exp(beta_C'*x_i) and compute Y_inf_i=min(T_i,C_i) and a censored flag delta_inf_i=1 if Y_inf_i = C_i (an observed event), and delta_inf_i=0 if Y_inf_i C_i (informatively censored: event not observed). I am convinced this is informative censoring because as long as beta_T~=0 and beta_C~=0 then for each subject the data generating process for T and C both depend on x. In contrast I generate non-informative censoring times D_i~Weibull(lambda_D*exp(beta_D),v_D), and compute Y_ninf_i=min(T_i,D_i) and a censored flag delta_ninf_i=1 if Y_ninf_i = D_i (an observed event), and delta_ninf_i=0 if Y_ninf_i D_i (non-informatively censored: event not observed). Here beta_D is a scalar. I scale the simulation by choosing the lambda_T, lambda_C and lambda_D parameters
Re: [R] How to simulate informative censoring in a Cox PH model?
I think that the Cox model still works well when the only information in the censoring is conditional on variables in the model. What you describe could be called non-informative conditional on x. To really see the difference you need informative censoring that depends on something not included in the model. One option would be to use copulas to generate dependent data and then transform the values using your Weibul. Or you could generate your event times and censoring times based on x1 and x2, but then only include x1 in the model. On Wed, Jul 22, 2015 at 2:20 AM, Daniel Meddings dpmeddi...@gmail.com wrote: I wish to simulate event times where the censoring is informative, and to compare parameter estimator quality from a Cox PH model with estimates obtained from event times generated with non-informative censoring. However I am struggling to do this, and I conclude rather than a technical flaw in my code I instead do not understand what is meant by informative and un-informative censoring. My approach is to simulate an event time T dependent on a vector of covariates x having hazard function h(t|x)=lambda*exp(beta'*x)v*t^{v-1}. This corresponds to T~ Weibull(lambda(x),v), where the scale parameter lambda(x)=lambda*exp(beta'*x) depends on x and the shape parameter v is fixed. I have N subjects where T_{i}~ Weibull(lambda(x_{i}),v_{T}), lambda(x_{i})=lambda_{T}*exp(beta_{T}'*x_{i}), for i=1,...,N. Here I assume the regression coefficients are p-dimensional. I generate informative censoring times C_i~ Weibull(lambda(x_i),v_C), lambda(x_i)=lambda_C*exp(beta_C'*x_i) and compute Y_inf_i=min(T_i,C_i) and a censored flag delta_inf_i=1 if Y_inf_i = C_i (an observed event), and delta_inf_i=0 if Y_inf_i C_i (informatively censored: event not observed). I am convinced this is informative censoring because as long as beta_T~=0 and beta_C~=0 then for each subject the data generating process for T and C both depend on x. In contrast I generate non-informative censoring times D_i~Weibull(lambda_D*exp(beta_D),v_D), and compute Y_ninf_i=min(T_i,D_i) and a censored flag delta_ninf_i=1 if Y_ninf_i = D_i (an observed event), and delta_ninf_i=0 if Y_ninf_i D_i (non-informatively censored: event not observed). Here beta_D is a scalar. I scale the simulation by choosing the lambda_T, lambda_C and lambda_D parameters such that on average T_iC_i and T_iD_i to achieve X% of censored subjects for both Y_inf_i and Y_ninf_i. The problem is that even for say 30% censoring (which I think is high), the Cox PH parameter estimates using both Y_inf and Y_ninf are unbiased when I expected the estimates using Y_inf to be biased, and I think I see why: however different beta_C is from beta_T, a censored subject can presumably influence the estimation of beta_T only by affecting the set of subjects at risk at any time t, but this does not change the fact that every single Y_inf_i with delta_inf_i=1 will have been generated using beta_T only. Thus I do not see how my simulation can possibly produce biased estimates for beta_T using Y_inf. But then what is informative censoring if not based on this approach? Any help would be greatly appreciated. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. 538...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.