Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Dear all, Your constant talking about what bootstrap is and is not suitable for made me finally verify the findings in the Pawinski et al paper. Here is the procedure and the findings: - First of all I took the raw data (that was posted earlier on this list) and estimated the AUC values using equation coefficients of their recommended model (#10). Though, I was _unable to reproduce_ the r^2, nor the predictive performance values. My results are 0.74 and 44%, respectively, while the reported figures were 0.862 and 82% (41 profiles out of 50). My scatterplot also looks different than the Fig.2 model 10 scatterplot. Weird... - Then, I fit the multiple linear model to the whole dataset (no bootstrap), using the time-points of model #10. I obtained r^2 of 0.74 (agreement), mean prediction error of 7.4% +-28.3% and predictive performance of 44%. The mean reported prediction error (PE) was 7.6% +-26.7% and predictive performance: 56% (page 1502, second column, sentence 2nd from top)! I think the difference in PE may be attributed to numerical differences between SPSS and R, though I can't explain the difference in predictive performance. - Finally, I used Gustaf's bootstrap code to fit linear regression with model #10 time-points on the resampled dataset. The r^2 of the model with median coefficients was identical to that of the model fit to entire data, and the predictive performance was better by only one profile in the range: 46%. As you see, these figures are very far from the numbers reported in the paper. I will be in discussion with the authors on how they obtained these numbers, but I am having doubts if this paper is valid at all... - Later I tested it on my own dataset (paper to appear in August), and found that the MLR model fit on entire data has identical r^2 and predictive performance as the median coefficient model from bootstrap. I must admit, guys, *that I was wrong and you were right: this bootstrap-like procedure does not improve predictions* - at least not to the extent reported in the Pawinski et al paper. I was blindly believing in this paper and I am somewhat embarrassed that I didn't verify these findings, despite that their dataset was available to me since beginning. Maybe it was too much trust in printed word and in authority of a PhD biostatistician who devised the procedure... Nevertheless, I am happy that at least this procedure is harmless, and that I can reproduce the figures reported in /my/ paper. Best regards, and apologies for being such a hard student. I am being converted to orthodox statistics. -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Gustaf Rydevik wrote: On Thu, Jul 31, 2008 at 4:30 PM, Michal Figurski [EMAIL PROTECTED] wrote: Frank and all, The point you were looking for was in a page that was linked from the referenced page - I apologize for confusion. Please take a look at the two last paragraphs here: http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm Though, possibly it's my ignorance, maybe it's yours, but you actually missed the important point again. It is that you just don't estimate mean, or CI, or variance on PK profile data! It is as if you were trying to estimate mean, CI and variance of a Toccata__Fugue_in_D_minor.wav file. What for? The point is in the music! Would the mean or CI or variance tell you anything about that? Besides, everybody knows the variance (or variability?) is there and can estimate it without spending time on calculations. What I am trying to do is comparable to compressing a wave into mp3 - to predict the wave using as few data points as possible. I have a bunch of similar waves and I'm trying to find a common equation to predict them all. I am *not* looking for the variance of the mean! I could be wrong (though it seems less and less likely), but you keep talking about the same irrelevant parameters (CI, variance) on and on. Well, yes - we are at a standstill, but not because of Davison Hinkley's book. I can try reading it, though as I stated above, it is not even remotely related to what I am trying to do. I'll skip it then - life is too short. Nevertheless I thank you (all) for relevant criticism on the procedure (in the points where it was relevant). I plan to use this methodology further, and it was good to find out that it withstood your criticism. I will look into the penalized methods, though. -- Michal J. Figurski I take it you mean the sentence: For example, in here, the statistical estimator is the sample mean. Using bootstrap sampling, you can do beyond your statistical estimators. You can now get even the distribution of your estimator and the statistics (such as confidence interval, variance) of your estimator. Again you are misinterpreting text. The phrase about doing beyond your statistical estimators, is
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Frank and all, The point you were looking for was in a page that was linked from the referenced page - I apologize for confusion. Please take a look at the two last paragraphs here: http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm Though, possibly it's my ignorance, maybe it's yours, but you actually missed the important point again. It is that you just don't estimate mean, or CI, or variance on PK profile data! It is as if you were trying to estimate mean, CI and variance of a Toccata__Fugue_in_D_minor.wav file. What for? The point is in the music! Would the mean or CI or variance tell you anything about that? Besides, everybody knows the variance (or variability?) is there and can estimate it without spending time on calculations. What I am trying to do is comparable to compressing a wave into mp3 - to predict the wave using as few data points as possible. I have a bunch of similar waves and I'm trying to find a common equation to predict them all. I am *not* looking for the variance of the mean! I could be wrong (though it seems less and less likely), but you keep talking about the same irrelevant parameters (CI, variance) on and on. Well, yes - we are at a standstill, but not because of Davison Hinkley's book. I can try reading it, though as I stated above, it is not even remotely related to what I am trying to do. I'll skip it then - life is too short. Nevertheless I thank you (all) for relevant criticism on the procedure (in the points where it was relevant). I plan to use this methodology further, and it was good to find out that it withstood your criticism. I will look into the penalized methods, though. -- Michal J. Figurski Frank E Harrell Jr wrote: Michal Figurski wrote: Tim, If I understand correctly, you are saying that one can't improve on estimating a mean by doing bootstrap and summarizing means of many such steps. As far as I understand (again), you're saying that this way one can only add bias without any improvement... Well, this is in contradiction to some guides to bootstrap, that I found on the web (I did my homework), for example to this one: http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap Statistic Mean.htm Where on that web site does it state anything that is remotely related to your point? It shows how to use the bootstrap to estimate the bias, does not show that the bias is important (it isn't; the simulation is from a normal distribution and the sample mean is perfectly unbiased; you are just seeing sampling error in the bias estimate). It is all confusing, guys... Once somebody said, that there are as many opinions on a topic, as there are statisticians... Also, translating your statements into the example of hammer and rock, you are saying that one cannot use hammer to break rocks because it was created to drive nails. With all respect, despite my limited knowledge, I do not agree. The big point is that the mean, or standard error, or confidence intervals of the data itself are *meaningless* in the pharmacokinetic dataset. These data are time series of a highly variable quantity, that is known to display a peak (or two in the case of Pawinski's paper). It is as if you tried to calculate a mean of a chromatogram (example for chemists, sorry). Nevertheless, I thank all of you, experts, for your insight and advice. In the end, I learned a lot, though I keep my initial view. Summarizing your criticism of the procedure described in Pawinski's paper: If you think that you can learn statistics easily when I would have a devil of a time learning chemistry, and if you are not willing to read for example the Davison and Hinkley bootstrap text, I guess we are at a standstill. Frank Harrell - Some of you say that this isn't bootstrap at all. In terms of terminology I totally submit to that, because I know too little. Would anyone suggest a name? - Most of you say that this procedure is not the best one, that there are better ways. I will definitely do my homework on penalized regression, though no one of you has actually discredited this methodology. Therefore, though possibly not optimal, it remains valid. - The criticism on predictive performance is that one has to take into account also other important quantities, like bias, variance, etc. Fortunately I did that in my work: using RMSE and log residuals from the validation process. I just observed that models with relatively small RMSE and log residuals (compared to other models) usually possess good predictive performance. And vice versa. Predictive performance has also a great advantage over RMSE or variance or anything else suggested here - it is easily understood by non-statisticians. I don't think it is /too simple/ in Einstein's terms, it's just simple. Kind regards, -- Michal J. Figurski Tim Hesterberg wrote: I'll address the question of whether you can use the bootstrap to improve estimates,
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
On Thu, Jul 31, 2008 at 4:30 PM, Michal Figurski [EMAIL PROTECTED] wrote: Frank and all, The point you were looking for was in a page that was linked from the referenced page - I apologize for confusion. Please take a look at the two last paragraphs here: http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm Though, possibly it's my ignorance, maybe it's yours, but you actually missed the important point again. It is that you just don't estimate mean, or CI, or variance on PK profile data! It is as if you were trying to estimate mean, CI and variance of a Toccata__Fugue_in_D_minor.wav file. What for? The point is in the music! Would the mean or CI or variance tell you anything about that? Besides, everybody knows the variance (or variability?) is there and can estimate it without spending time on calculations. What I am trying to do is comparable to compressing a wave into mp3 - to predict the wave using as few data points as possible. I have a bunch of similar waves and I'm trying to find a common equation to predict them all. I am *not* looking for the variance of the mean! I could be wrong (though it seems less and less likely), but you keep talking about the same irrelevant parameters (CI, variance) on and on. Well, yes - we are at a standstill, but not because of Davison Hinkley's book. I can try reading it, though as I stated above, it is not even remotely related to what I am trying to do. I'll skip it then - life is too short. Nevertheless I thank you (all) for relevant criticism on the procedure (in the points where it was relevant). I plan to use this methodology further, and it was good to find out that it withstood your criticism. I will look into the penalized methods, though. -- Michal J. Figurski I take it you mean the sentence: For example, in here, the statistical estimator is the sample mean. Using bootstrap sampling, you can do beyond your statistical estimators. You can now get even the distribution of your estimator and the statistics (such as confidence interval, variance) of your estimator. Again you are misinterpreting text. The phrase about doing beyond your statistical estimators, is explained in the next sentence, where he says that using bootstrap gives you information about the mean *estimator* (and not more information about the population mean). And since you're not interested in this information, in your case bootstrap/resampling is not useful at all. As another example of misinterpretation: In your email from a week ago, it sounds like you believe that the authors of the original paper are trying to improve on a fixed model Figurski: Regarding the multiple stepwise regression - according to the cited SPSS manual, there are 5 options to select from. I don't think they used 'stepwise selection' option, because their models were already pre-defined. Variables were pre-selected based on knowledge of pharmacokinetics of this drug and other factors. I think this part I understand pretty well. This paragraph is wrong. Sorry, no way around it. Quoting from the paper Pawinski etal: *__Twenty-six(!)* 1-, 2-, or 3-sample estimation models were fit (r2 0.341– 0.862) to a randomly selected subset of the profiles using linear regression and were used to estimate AUC0–12h for the profiles not included in the regression fit, comparing those estimates with the corresponding AUC0–12h values, calculated with the linear trapezoidal rule, including all 12 timed MPA concentrations. The 3-sample models were constrained to include no samples past 2 h. (emph. mine) They clearly state that they are choosing among 26 different models by using their bootstrap-like procedure, not improving on a single, predefined model. This procedure is statistically sound (more or less at least), and not controversial. However, (again) what you are wanting to do is *not* what they did in their paper! resampling can not improve on the performance of a pre-specified model. This is intuitively obvious, but moreover its mathematically provable! That's why we're so certain of our standpoint. If you really wish, I (or someone else) could write out a proof, but I'm unsure if you would be able to follow. In the end, it doesn't really matter. What you are doing amounts to doing a regression 50 times, when once would suffice. No big harm done, just a bit of unnecessary work. And proof to a statistically competent reviewer that you don't really understand what you're doing. The better option would be to either study some more statistics yourself, or find a statistician that can do your analysis for you, and trust him to do it right. Anyhow, good luck with your research. Best regards, Gustaf -- Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Gustaf, Summarizing things I don't understand: - Honestly, I was thinking I can use bootstrap to obtain better estimate of a mean - provided that I want it. So, I can't? - If I can't obtain reliable estimates of CI and variance from a small dataset, but I can do it with bootstrap - isn't it a virtual increase of the size of dataset? OK, these are just words, I won't fight for that. - I don't understand why a procedure works for 26 models and doesn't work for one... Intuitively this doesn't make sense... - I don't understand why resampling *cannot* improve... while it does? I know the proof is going to be hard to follow, but let me try! (The proof of the opposite is in the paper). - I truly don't understand what I don't understand about what I am doing. This is getting too much convoluted for me... And a remark about what I don't agree with Gustaf: The text below, quoted from Pawinski et al (Twenty six...), is missing an important information - that they repeated that step 50 times - each time with randomly selected subset. Excuse my ignorance again, but this looks like bootstrap (re-sampling), doesn't it? Although I won't argue for names. I want to assure everyone here that I did *exactly* what they did. I work in the same lab, that this paper came from, and I just had their procedure in SPSS translated to SAS. Moreover, the translation was done with help of a _trustworthy biostatistician_ - I was not that good with SAS at the time to do it myself. The biostatistician wrote the randomization and regression subroutines. I later improved them using macros (less code) and added validation part. It was then approved by that biostatistician. OK, I did not exactly do the same, because I repeated the step 100 times for 34 *pre-defined* models and on a different dataset. But that's about all the difference. I hope this solves everyone's dilemma whether I did what is described in Pawinski's paper or not. This discussion, though, started with my question on: how to do it in R, instead of SAS, and with logistic (not linear) regression. Thank you, Gustaf, for the code - this was the help I needed. -- Michal J. Figurski Gustaf Rydevik wrote: For example, in here, the statistical estimator is the sample mean. Using bootstrap sampling, you can do beyond your statistical estimators. You can now get even the distribution of your estimator and the statistics (such as confidence interval, variance) of your estimator. Again you are misinterpreting text. The phrase about doing beyond your statistical estimators, is explained in the next sentence, where he says that using bootstrap gives you information about the mean *estimator* (and not more information about the population mean). And since you're not interested in this information, in your case bootstrap/resampling is not useful at all. As another example of misinterpretation: In your email from a week ago, it sounds like you believe that the authors of the original paper are trying to improve on a fixed model Figurski: Regarding the multiple stepwise regression - according to the cited SPSS manual, there are 5 options to select from. I don't think they used 'stepwise selection' option, because their models were already pre-defined. Variables were pre-selected based on knowledge of pharmacokinetics of this drug and other factors. I think this part I understand pretty well. This paragraph is wrong. Sorry, no way around it. Quoting from the paper Pawinski etal: *__Twenty-six(!)* 1-, 2-, or 3-sample estimation models were fit (r2 0.341� 0.862) to a randomly selected subset of the profiles using linear regression and were used to estimate AUC0�12h for the profiles not included in the regression fit, comparing those estimates with the corresponding AUC0�12h values, calculated with the linear trapezoidal rule, including all 12 timed MPA concentrations. The 3-sample models were constrained to include no samples past 2 h. (emph. mine) They clearly state that they are choosing among 26 different models by using their bootstrap-like procedure, not improving on a single, predefined model. This procedure is statistically sound (more or less at least), and not controversial. However, (again) what you are wanting to do is *not* what they did in their paper! resampling can not improve on the performance of a pre-specified model. This is intuitively obvious, but moreover its mathematically provable! That's why we're so certain of our standpoint. If you really wish, I (or someone else) could write out a proof, but I'm unsure if you would be able to follow. In the end, it doesn't really matter. What you are doing amounts to doing a regression 50 times, when once would suffice. No big harm done, just a bit of unnecessary work. And proof to a statistically competent reviewer that you don't really understand what you're doing. The better option would be to either study some more statistics yourself, or find a statistician that can do your analysis for
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Tim, If I understand correctly, you are saying that one can't improve on estimating a mean by doing bootstrap and summarizing means of many such steps. As far as I understand (again), you're saying that this way one can only add bias without any improvement... Well, this is in contradiction to some guides to bootstrap, that I found on the web (I did my homework), for example to this one: http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap Statistic Mean.htm It is all confusing, guys... Once somebody said, that there are as many opinions on a topic, as there are statisticians... Also, translating your statements into the example of hammer and rock, you are saying that one cannot use hammer to break rocks because it was created to drive nails. With all respect, despite my limited knowledge, I do not agree. The big point is that the mean, or standard error, or confidence intervals of the data itself are *meaningless* in the pharmacokinetic dataset. These data are time series of a highly variable quantity, that is known to display a peak (or two in the case of Pawinski's paper). It is as if you tried to calculate a mean of a chromatogram (example for chemists, sorry). Nevertheless, I thank all of you, experts, for your insight and advice. In the end, I learned a lot, though I keep my initial view. Summarizing your criticism of the procedure described in Pawinski's paper: - Some of you say that this isn't bootstrap at all. In terms of terminology I totally submit to that, because I know too little. Would anyone suggest a name? - Most of you say that this procedure is not the best one, that there are better ways. I will definitely do my homework on penalized regression, though no one of you has actually discredited this methodology. Therefore, though possibly not optimal, it remains valid. - The criticism on predictive performance is that one has to take into account also other important quantities, like bias, variance, etc. Fortunately I did that in my work: using RMSE and log residuals from the validation process. I just observed that models with relatively small RMSE and log residuals (compared to other models) usually possess good predictive performance. And vice versa. Predictive performance has also a great advantage over RMSE or variance or anything else suggested here - it is easily understood by non-statisticians. I don't think it is /too simple/ in Einstein's terms, it's just simple. Kind regards, -- Michal J. Figurski Tim Hesterberg wrote: I'll address the question of whether you can use the bootstrap to improve estimates, and whether you can use the bootstrap to virtually increase the size of the sample. Short answer - no, with some exceptions (bumping / Random Forests). Longer answer: Suppose you have data (x1, ..., xn) and a statistic ThetaHat, that you take a number of bootstrap samples (all of size n) and let ThetaHatBar be the average of those bootstrap statistics from those samples. Is ThetaHatBar better than ThetaHat? Usually not. Usually it is worse. You have not collected any new data, you are just using the existing data in a different way, that is usually harmful: * If the statistic is the sample mean, all this does is to add some noise to the estimate * If the statistic is nonlinear, this gives an estimate that has roughly double the bias, without improving the variance. What are the exceptions? The prime example is tree models (random forests) - taking bootstrap averages helps smooth out the discontinuities in tree models. For a simple example, suppose that a simple linear regression model really holds: y = beta x + epsilon but that you fit a tree model; the tree model predictions are a step function. If you bootstrap the data, the boundaries of the step function will differ from one sample to another, so the average of the bootstrap samples smears out the steps, getting closer to the smooth linear relationship. Aside from such exceptions, the bootstrap is used for inference (bias, standard error, confidence intervals), not improving on ThetaHat. Tim Hesterberg Hi Doran, Maybe I am wrong, but I think bootstrap is a general resampling method which can be used for different purposes...Usually it works well when you do not have a presentative sample set (maybe with limited number of samples). Therefore, I am positive with Michal... P.S., overfitting, in my opinion, is used to depict when you got a model which is quite specific for the training dataset but cannot be generalized with new samples.. Thanks, --Jerry 2008/7/21 Doran, Harold [EMAIL PROTECTED]: I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No, not really. The bootstrap is a resampling method for variance estimation. It is often used when there is not an easy way, or a closed form expression, for estimating the
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal Figurski wrote: Tim, If I understand correctly, you are saying that one can't improve on estimating a mean by doing bootstrap and summarizing means of many such steps. As far as I understand (again), you're saying that this way one can only add bias without any improvement... Well, this is in contradiction to some guides to bootstrap, that I found on the web (I did my homework), for example to this one: http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap Statistic Mean.htm Where on that web site does it state anything that is remotely related to your point? It shows how to use the bootstrap to estimate the bias, does not show that the bias is important (it isn't; the simulation is from a normal distribution and the sample mean is perfectly unbiased; you are just seeing sampling error in the bias estimate). It is all confusing, guys... Once somebody said, that there are as many opinions on a topic, as there are statisticians... Also, translating your statements into the example of hammer and rock, you are saying that one cannot use hammer to break rocks because it was created to drive nails. With all respect, despite my limited knowledge, I do not agree. The big point is that the mean, or standard error, or confidence intervals of the data itself are *meaningless* in the pharmacokinetic dataset. These data are time series of a highly variable quantity, that is known to display a peak (or two in the case of Pawinski's paper). It is as if you tried to calculate a mean of a chromatogram (example for chemists, sorry). Nevertheless, I thank all of you, experts, for your insight and advice. In the end, I learned a lot, though I keep my initial view. Summarizing your criticism of the procedure described in Pawinski's paper: If you think that you can learn statistics easily when I would have a devil of a time learning chemistry, and if you are not willing to read for example the Davison and Hinkley bootstrap text, I guess we are at a standstill. Frank Harrell - Some of you say that this isn't bootstrap at all. In terms of terminology I totally submit to that, because I know too little. Would anyone suggest a name? - Most of you say that this procedure is not the best one, that there are better ways. I will definitely do my homework on penalized regression, though no one of you has actually discredited this methodology. Therefore, though possibly not optimal, it remains valid. - The criticism on predictive performance is that one has to take into account also other important quantities, like bias, variance, etc. Fortunately I did that in my work: using RMSE and log residuals from the validation process. I just observed that models with relatively small RMSE and log residuals (compared to other models) usually possess good predictive performance. And vice versa. Predictive performance has also a great advantage over RMSE or variance or anything else suggested here - it is easily understood by non-statisticians. I don't think it is /too simple/ in Einstein's terms, it's just simple. Kind regards, -- Michal J. Figurski Tim Hesterberg wrote: I'll address the question of whether you can use the bootstrap to improve estimates, and whether you can use the bootstrap to virtually increase the size of the sample. Short answer - no, with some exceptions (bumping / Random Forests). Longer answer: Suppose you have data (x1, ..., xn) and a statistic ThetaHat, that you take a number of bootstrap samples (all of size n) and let ThetaHatBar be the average of those bootstrap statistics from those samples. Is ThetaHatBar better than ThetaHat? Usually not. Usually it is worse. You have not collected any new data, you are just using the existing data in a different way, that is usually harmful: * If the statistic is the sample mean, all this does is to add some noise to the estimate * If the statistic is nonlinear, this gives an estimate that has roughly double the bias, without improving the variance. What are the exceptions? The prime example is tree models (random forests) - taking bootstrap averages helps smooth out the discontinuities in tree models. For a simple example, suppose that a simple linear regression model really holds: y = beta x + epsilon but that you fit a tree model; the tree model predictions are a step function. If you bootstrap the data, the boundaries of the step function will differ from one sample to another, so the average of the bootstrap samples smears out the steps, getting closer to the smooth linear relationship. Aside from such exceptions, the bootstrap is used for inference (bias, standard error, confidence intervals), not improving on ThetaHat. Tim Hesterberg Hi Doran, Maybe I am wrong, but I think bootstrap is a general resampling method which can be used for different purposes...Usually it works well when you do not have a presentative sample set (maybe with limited number of samples).
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
I'll address the question of whether you can use the bootstrap to improve estimates, and whether you can use the bootstrap to virtually increase the size of the sample. Short answer - no, with some exceptions (bumping / Random Forests). Longer answer: Suppose you have data (x1, ..., xn) and a statistic ThetaHat, that you take a number of bootstrap samples (all of size n) and let ThetaHatBar be the average of those bootstrap statistics from those samples. Is ThetaHatBar better than ThetaHat? Usually not. Usually it is worse. You have not collected any new data, you are just using the existing data in a different way, that is usually harmful: * If the statistic is the sample mean, all this does is to add some noise to the estimate * If the statistic is nonlinear, this gives an estimate that has roughly double the bias, without improving the variance. What are the exceptions? The prime example is tree models (random forests) - taking bootstrap averages helps smooth out the discontinuities in tree models. For a simple example, suppose that a simple linear regression model really holds: y = beta x + epsilon but that you fit a tree model; the tree model predictions are a step function. If you bootstrap the data, the boundaries of the step function will differ from one sample to another, so the average of the bootstrap samples smears out the steps, getting closer to the smooth linear relationship. Aside from such exceptions, the bootstrap is used for inference (bias, standard error, confidence intervals), not improving on ThetaHat. Tim Hesterberg Hi Doran, Maybe I am wrong, but I think bootstrap is a general resampling method which can be used for different purposes...Usually it works well when you do not have a presentative sample set (maybe with limited number of samples). Therefore, I am positive with Michal... P.S., overfitting, in my opinion, is used to depict when you got a model which is quite specific for the training dataset but cannot be generalized with new samples.. Thanks, --Jerry 2008/7/21 Doran, Harold [EMAIL PROTECTED]: I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No, not really. The bootstrap is a resampling method for variance estimation. It is often used when there is not an easy way, or a closed form expression, for estimating the sampling variance of a statistic. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Greg and all, Just another thought on bias and variability. As I tried to explain, I perceive this problem as a very practical problem. The equation, that is the goal of this work, is supposed to serve the clinicians to estimate a pharmacokinetic parameter. It therefore must be simple and also presented in simple language, so that an average spreadsheet user can make use of it. Therefore, in the end, isn't the *predictive performance* an ultimate measure of it all? Doesn't it account for bias and all the other stuff? It does tell you in how many cases you may expect to have the predicted value within 15% of the true value. I apologize for my naive questions again, but aren't then the calculations of bias and variance, etc, just a waste of time, while you have it all summarized in the predictive performance? -- Michal J. Figurski Greg Snow wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Wednesday, July 23, 2008 10:22 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The bootstrap is more like a whole toolbox than just a single tool. I think part of the confusion in this discussion is because you kept asking for a hammer and Frank and others kept looking at their toolbox full of hammers and asking you which one you wanted. Yes you can break a rock with a hammer designed to drive nails, but why not use the hammer designed to break rocks when it is easily available. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. But to get the model with maximum reliability and accuracy you need to account for bias and minimize variability. You may not care what those numbers are directly, but you do care indirectly about their influence on your final model. Another instance where both sides were talking past each other. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
To quote (or as nearly so as I can) Einstein's famous remark: Make everything as simple as possible ... but no simpler Moreover, as possible here means maintaining fidelity to scientific validity, not simple enough for me to understand. So I don't think a physicist can explain relativistic cosmology to me (or an organic chemist, how to synthesize ketones) so that I can understand it without compromising scientific validity. The onus is then on me to either learn what I need to know to understand it, or accept the authoritative view of the physicist (or chemist). I cannot claim ignorance and reject the cosmology because it is beyond me. That's the flat earth philosophy of science, and it is a terrible obstacle to scientific progress and human enlightenment, in general. Cheers, Bert Gunter -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Thursday, July 24, 2008 8:03 AM Cc: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Greg and all, Just another thought on bias and variability. As I tried to explain, I perceive this problem as a very practical problem. The equation, that is the goal of this work, is supposed to serve the clinicians to estimate a pharmacokinetic parameter. It therefore must be simple and also presented in simple language, so that an average spreadsheet user can make use of it. Therefore, in the end, isn't the *predictive performance* an ultimate measure of it all? Doesn't it account for bias and all the other stuff? It does tell you in how many cases you may expect to have the predicted value within 15% of the true value. I apologize for my naive questions again, but aren't then the calculations of bias and variance, etc, just a waste of time, while you have it all summarized in the predictive performance? -- Michal J. Figurski Greg Snow wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Wednesday, July 23, 2008 10:22 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The bootstrap is more like a whole toolbox than just a single tool. I think part of the confusion in this discussion is because you kept asking for a hammer and Frank and others kept looking at their toolbox full of hammers and asking you which one you wanted. Yes you can break a rock with a hammer designed to drive nails, but why not use the hammer designed to break rocks when it is easily available. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. But to get the model with maximum reliability and accuracy you need to account for bias and minimize variability. You may not care what those numbers are directly, but you do care indirectly about their influence on your final model. Another instance where both sides were talking past each other. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
What are the arguments against fidelity of this concept to scientific validity? The concept of predictive performance was devised by one of you, biostatisticians - not me! I accept the authoritative view of the person that did it, especially because I do understand it. When I think of it, excuse my ignorance, it looks to me that this measure summarizes effects of bias, variance, etc, and all the analytical and other errors. Please correct me if I am wrong, but spare me your sarcasm. -- Michal J. Figurski Bert Gunter wrote: To quote (or as nearly so as I can) Einstein's famous remark: Make everything as simple as possible ... but no simpler Moreover, as possible here means maintaining fidelity to scientific validity, not simple enough for me to understand. So I don't think a physicist can explain relativistic cosmology to me (or an organic chemist, how to synthesize ketones) so that I can understand it without compromising scientific validity. The onus is then on me to either learn what I need to know to understand it, or accept the authoritative view of the physicist (or chemist). I cannot claim ignorance and reject the cosmology because it is beyond me. That's the flat earth philosophy of science, and it is a terrible obstacle to scientific progress and human enlightenment, in general. Cheers, Bert Gunter -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Thursday, July 24, 2008 8:03 AM Cc: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Greg and all, Just another thought on bias and variability. As I tried to explain, I perceive this problem as a very practical problem. The equation, that is the goal of this work, is supposed to serve the clinicians to estimate a pharmacokinetic parameter. It therefore must be simple and also presented in simple language, so that an average spreadsheet user can make use of it. Therefore, in the end, isn't the *predictive performance* an ultimate measure of it all? Doesn't it account for bias and all the other stuff? It does tell you in how many cases you may expect to have the predicted value within 15% of the true value. I apologize for my naive questions again, but aren't then the calculations of bias and variance, etc, just a waste of time, while you have it all summarized in the predictive performance? -- Michal J. Figurski Greg Snow wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Wednesday, July 23, 2008 10:22 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The bootstrap is more like a whole toolbox than just a single tool. I think part of the confusion in this discussion is because you kept asking for a hammer and Frank and others kept looking at their toolbox full of hammers and asking you which one you wanted. Yes you can break a rock with a hammer designed to drive nails, but why not use the hammer designed to break rocks when it is easily available. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. But to get the model with maximum reliability and accuracy you need to account for bias and minimize variability. You may not care what those numbers are directly, but you do care indirectly about their influence on your final model. Another instance where both sides were talking past each other. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
--- On Thu, 7/24/08, Michal Figurski [EMAIL PROTECTED] wrote: From: Michal Figurski [EMAIL PROTECTED] Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? To: Cc: r-help@r-project.org r-help@r-project.org Received: Thursday, July 24, 2008, 11:02 AM Greg and all, Just another thought on bias and variability. As I tried to explain, I perceive this problem as a very practical problem. The equation, that is the goal of this work, is supposed to serve the clinicians to estimate a pharmacokinetic parameter. It therefore must be simple and also presented in simple language, so that an average spreadsheet user can make use of it. Therefore, in the end, isn't the *predictive performance* an ultimate measure of it all? Doesn't it account for bias and all the other stuff? I think you need to look at Greg Snow's comment again. I am not a statistician but Greg says: But to get the model with maximum reliability and accuracy you need to account for bias and minimize variability. As I read it, your predictive validity is partly a function of how well you account for bia and minimize variablility. Prediction may be the desired outcome but you don't get the best possible outcome unless you manage to account for these issues. It does tell you in how many cases you may expect to have the predicted value within 15% of the true value. I apologize for my naive questions again, but aren't then the calculations of bias and variance, etc, just a waste of time, while you have it all summarized in the predictive performance? -- Michal J. Figurski Greg Snow wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Wednesday, July 23, 2008 10:22 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The bootstrap is more like a whole toolbox than just a single tool. I think part of the confusion in this discussion is because you kept asking for a hammer and Frank and others kept looking at their toolbox full of hammers and asking you which one you wanted. Yes you can break a rock with a hammer designed to drive nails, but why not use the hammer designed to break rocks when it is easily available. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. But to get the model with maximum reliability and accuracy you need to account for bias and minimize variability. You may not care what those numbers are directly, but you do care indirectly about their influence on your final model. Another instance where both sides were talking past each other. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ [[elided Yahoo spam]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Hi All, It really comes down to a question of attitude: you either want to learn something fundamental or core and so bootstrap yourself to a better place (at least away from where you are), or you don't. As Marc said, Michal seems to have erected a wall around his thinking. I don't think it's fair to take pot shots at Frank for not wanting to promote or further something he doesn't believe in. He's a regular contributor to the list, who gives sound advice. He's also one of the few experts on the list who is prepared to give statistical advice. Regards, Mark. Rolf Turner-3 wrote: On 23/07/2008, at 1:17 PM, Frank E Harrell Jr wrote: Michal Figurski wrote: Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski You don't understand any of the theory and you are using techniques you don't understand and have provided no motivation for. And you are the one who is frustrated with others. Wow. Come off it guys. It is indeed very frustrating when one asks ``How can I do X'' and gets told ``Don't do X, do Y.'' It may well be the case that doing X is wrong-headed, misleading, and may cause the bridge to fall down, or the world to come to an end. Fair enough to point this out --- but then why not just tell the poor beggar, who asked, how to do X? The only circumstance in which *not* telling the poor beggar how to do X is justified is that in which it takes considerable *work* to figure out how to do X. In this case it is perfectly reasonable to say ``I think doing X is stupid so I am not going to waste my time figuring out for you how to do it.'' I don't know enough about the bootstrapping software (don't know *anything* about it actually) to know whether the foregoing circumstance applies here. But I suspect it doesn't. And I suspect that you (Frank) could tell Michal in a few lines the answer to the question that he *asked* (as opposed, possibly, to the question that he should have asked). If it were my problem I'd just write my own bootstrapping function to apply to the problem in hand. It can't be that hard ... just a for loop and a call to sample(...,replace=TRUE). If you can write macros in SAS then . cheers, Rolf ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Coefficients-of-Logistic-Regression-from-bootstrap---how-to-get-them--tp18570684p18605881.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Thank you Gustaf, I apologize for not including an example data in my first email. Nevertheless, your code worked for me excellently - I only added 55 as the size of sample. I must admit this code looks so much simpler, compared to SAS. I am beginning to love R, despite some disrespectful experts in this forum. -- Michal J. Figurski Gustaf Rydevik wrote: figurski.df-data.frame(name=1:109,num1=rnorm(109),num2=rnorm(109),num3=rnorm(109),outcome=sample(c(1,0),109,replace=T)) library(Design) lrm(outcome~num1+num2+num3,data=figurski.df)$coef coef-list() for (i in 1:100){ tempData-figurski.df[sample(1:109,replace=T),] coef[[i]]-lrm(outcome~num1+num2+num3,data=tempData)$coef } coef.df-data.frame(do.call(rbind,coef)) median(coef.df$num1) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
It seems you have accidentally hit a surgeons' mailing list where all you wanted was some advice how to use this scalpel on your body. Sorry if we can't be of any help without intimidating you with unrelated and pompous terms -- like coagulation. Michal J. Figurski [EMAIL PROTECTED] [Wed, Jul 23, 2008 at 04:54:36AM CEST]: Dear all, Since you guys are frank, let me be frank as well. I did not ask anyone to impose on me their point of view on bootstrap. It's my impression that this is what you guys are trying to do - that's sad. Some of your emails in this discussion are worth less than junk mail - particularly the ones from Mr Harold Doran. It's even more sad that you use junior members of this forum to make fun and intimidate. Apparently, even with all your expertise and education in this area, many of you - experts - do not understand what I am talking about. You seem to be so much affixed to your expertise, that you can't see anything beyond it. -- Johannes Hüsing There is something fascinating about science. One gets such wholesale returns of conjecture mailto:[EMAIL PROTECTED] from such a trifling investment of fact. http://derwisch.wikidot.com (Mark Twain, Life on the Mississippi) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski [EMAIL PROTECTED] wrote: Gustaf, I am sorry, but I don't get the point. Let's just focus on predictive performance from the cited passage, that is the number of values predicted within 15% of the original value. So, the predictive performance from the model fit on entire dataset was 56% of profiles, while from bootstrapped model it was 82% of profiles. Well - I see a stunning purpose in the bootstrap step here: it turns an useless equation into a clinically applicable model! Honestly, I also can't see how this can be better than fitting on entire dataset, but here you have a proof that it is. I think that another argument supporting this approach is model validation. If you fit model on entire data, you have no data left to validate its predictions. On the other hand, I agree with you that the passage in methods section looks awkward. In my work on a similar problem, that is going to appear in August in Ther Drug Monit, I used medians since beginning and all the comparisons were done based on models with median coefficients. I think this is what the authors of that paper did, though they might just have had a problem with describing it correctly, and unfortunately it passed through review process unchanged. Hi, I believe that you misunderstand the passage. Do you know what multiple stepwise regression is? Since they used SPSS, I copied from http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm Stepwise selection is a combination of forward and backward procedures. Step 1 The first predictor variable is selected in the same way as in forward selection. If the probability associated with the test of significance is less than or equal to the default .05, the predictor variable with the largest correlation with the criterion variable enters the equation first. Step 2 The second variable is selected based on the highest partial correlation. If it can pass the entry requirement (PIN=.05), it also enters the equation. Step 3 From this point, stepwise selection differs from forward selection: the variables already in the equation are examined for removal according to the removal criterion (POUT=.10) as in backward elimination. Step 4 Variables not in the equation are examined for entry. Variable selection ends when no more variables meet entry and removal criteria. --- It is the outcome of this *entire process*,step1-4, that they compare with the outcome of their *entire bootstrap/crossvalidation/selection process*, Step1-4 in the methods section, and find that their approach gives better result What you are doing is only step4 in the article's method section,estimating the parameters of a model *when you already know which variables to include*.It is the way this step is conducted that I am sceptical about. Regards, Gustaf -- Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. I have found from the mentioned paper (and from my own experience) that re-sampling and running the regression on re-sampled dataset multiple times does improve predictions. You have a proof of that in that paper, page 1502, and to me it is rather a stunning proof: compare 56% to 82% of correctly predicted values (correct means within 15% of original value). I can understand that it's somewhat new for many of you, and some tried to discourage me from this approach (shooting my foot). This concept was devised by, I believe, Mr Michael Hale, a respectable biostatistician. It utilises bootstrap concept of resampling, though, after recent discussion, I think it should be called another name. In addition to better predictive performance, using this concept I also get a second dataset with each iteration, that can be used for validation of the model. In this approach the validation data are accumulated throughout the bootstrap, and then used in the end to calculate log residuals using equation with median coefficients. I am sure you can question that in many ways, but to me this is as good as you can get. To be more practical, I will ask the authors of this paper if I can post their original dataset in this forum (I have it somewhere) - if you guys think it's interesting enough. Then anyone of you could use it, follow the procedure, and criticize, if they wish. -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 S Ellison wrote: jeez, but you've kicked up a storm! penn'orth on the bootstrap; and since I'm a chemist, you can ignore at will. The bootstrap starts with your data and the model you developed with it. Resampling gives a fair idea of what the variance _around your current estimate_ is. But it cannot tell you how biased you are or improve your estimate, because there is no more information in your data. Toy example. Let's say I get some results from some measurement procedure, like this. set.seed(408) #so we get the same random sample (!) y-rnorm(12,5) #OK, not a very convincing measurement, but #Now let's add a bit of bias y-y+3 mean(y) #... is my (biased) estimate of the mean value. #Now let's pretend I don't know the true answer OR the bias, which is what happens #in the real world, and try bootsrapping. Let's get a rather generous #1 resamples from my data; m-matrix(sample(y, length(y)*1, replace=T), ncol=length(y)) #This gives me a matrix with 1 rows, each of which is a resample #of my 12 data. #And now we can calculate 1 bootstrapped means in one shot: bs.mean-apply(m,1,mean) #which applies 'mean' to each row. #We hope the variance of these things is about 1/12, 'cos we got y from a normal distribution #with var 1 and we had 12 of them. let's see... var(bs.mean) #which should resemble 1/12 #and does.. roughly. #And for interest, compare with what we go direct from the data; var(y)/12 #which in this case was slightly further from the 'true' variance. It won't always be, though; #that depends on the data. #Anyway, the bootstrap variance looks about right. So ... on to bias #Now, where would we expect the bootstrapped mean value to be? #At the true value, or where we started? mean(bs.mean) #Oh dear. It's still biased. And it looks very much like the mean of y. #It's clearly told us nothing about the true mean. #Bottom line; All you have is your data. Bootstrapping uses your data. #Therefore, bootstrapping can tell you no more than you can get from your data. #But it's still useful if you have some rather more complicated statistic derived from #a non-linear fit, because it lets you get some idea of the variance. #But not the bias. This may be why some folk felt that your solution as worded (an ever-present peril, wording) was not an answer to the right question. The fitting procedure already gives you the 'best estimate' (where 'best' means max likelihood, this time), and bootstrapping really cannot improve on that. It can only start at your current 'best' and move away from it in a random direction. That can't possibly improve the estimated coefficients.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Wednesday, July 23, 2008 10:22 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The bootstrap is more like a whole toolbox than just a single tool. I think part of the confusion in this discussion is because you kept asking for a hammer and Frank and others kept looking at their toolbox full of hammers and asking you which one you wanted. Yes you can break a rock with a hammer designed to drive nails, but why not use the hammer designed to break rocks when it is easily available. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. But to get the model with maximum reliability and accuracy you need to account for bias and minimize variability. You may not care what those numbers are directly, but you do care indirectly about their influence on your final model. Another instance where both sides were talking past each other. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Gustaf Rydevik wrote: On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski [EMAIL PROTECTED] wrote: Gustaf, I am sorry, but I don't get the point. Let's just focus on predictive performance from the cited passage, that is the number of values predicted within 15% of the original value. So, the predictive performance from the model fit on entire dataset was 56% of profiles, while from bootstrapped model it was 82% of profiles. Well - I see a stunning purpose in the bootstrap step here: it turns an useless equation into a clinically applicable model! Honestly, I also can't see how this can be better than fitting on entire dataset, but here you have a proof that it is. I think that another argument supporting this approach is model validation. If you fit model on entire data, you have no data left to validate its predictions. On the other hand, I agree with you that the passage in methods section looks awkward. In my work on a similar problem, that is going to appear in August in Ther Drug Monit, I used medians since beginning and all the comparisons were done based on models with median coefficients. I think this is what the authors of that paper did, though they might just have had a problem with describing it correctly, and unfortunately it passed through review process unchanged. Hi, I believe that you misunderstand the passage. Do you know what multiple stepwise regression is? Since they used SPSS, I copied from http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm Stepwise selection is a combination of forward and backward procedures. Step 1 The first predictor variable is selected in the same way as in forward selection. If the probability associated with the test of significance is less than or equal to the default .05, the predictor variable with the largest correlation with the criterion variable enters the equation first. Step 2 The second variable is selected based on the highest partial correlation. If it can pass the entry requirement (PIN=.05), it also enters the equation. Step 3 From this point, stepwise selection differs from forward selection: the variables already in the equation are examined for removal according to the removal criterion (POUT=.10) as in backward elimination. Step 4 Variables not in the equation are examined for entry. Variable selection ends when no more variables meet entry and removal criteria. --- It is the outcome of this *entire process*,step1-4, that they compare with the outcome of their *entire bootstrap/crossvalidation/selection process*, Step1-4 in the methods section, and find that their approach gives better result What you are doing is only step4 in the article's method section,estimating the parameters of a model *when you already know which variables to include*.It is the way this step is conducted that I am sceptical about. Regards, Gustaf Perfectly stated Gustaf. This is a great example of needing to truly understand a method to be able to use it in the right context. After having read most of the paper by Pawinski et al now, there are other problems. 1. The paper nowhere uses bootstrapping. It uses repeated 2-fold cross-validation, a procedure not usually recommended. 2. The resampling procedure used in the paper treated the 50 pharmacokinetic profiles on 21 renal transplant patients as if these were from 50 patients. The cluster bootstrap should have been used instead. 3. Figure 2 showed the fitted regression line to the predicted vs. observed AUCs. It should have shown the line of identify instead. In other words, the authors allowed a subtle recalibration to creep into the analysis (and inverted the x- and y-variables in the plots). The fitted lines are far enough away from the line of identity as to show that the predicted values are not well calibrated. The r^2 values claimed by the authors used the wrong formulas which allowed an automatic after-the-fact recalibration (new overall slope and intercept are estimated in the test dataset). Hence the achieved r^2 are misleading. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal Figurski wrote: Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. My two cents: Bootstrapping (especially the optimism bootstrap, see Harrell 2001 ``Regression Modeling Strategies'') can be used to estimate how well a given model generalises. In other words, to estimate how much your model is overfitted to your data (more overfitting = less generalisable model). This in itself is not useful for getting the coefficients of a good model (which is always done through MLE), but it can be used to compare different models. As Frank Harrell mentioned, you can do penalised regression, and find the best penalty through bootstrapping. This will possibly yield a model that is less overfitted and hence more reliable in terms of being valid for an unseen sample (from the same population). Again, see Frank's book for more information about penalisation. -- Gad Abraham Dept. CSSE and NICTA The University of Melbourne Parkville 3010, Victoria, Australia email: [EMAIL PROTECTED] web: http://www.csse.unimelb.edu.au/~gabraham __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Probably a good idea for you. The R help list is useful for both programming AND statistical advice for those who want it. -Original Message- From: Michal Figurski [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 10:44 AM To: Doran, Harold; r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Hi Michal, This paper by John Fox may help you to precise what you are looking for and to perform your analyses http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf Nael On Tue, Jul 22, 2008 at 3:51 PM, Michal Figurski [EMAIL PROTECTED] wrote: Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
The bootstrap **can** be used for bias correction. However, it may not be such a good thing to do. I quote from Efron and Tibshirani's AN INTRODUCTION TO THE BOOTSTRAP (p.138): ... bias estimation is usually interesting and worthwhile, but the exact use of a bias estimate is often problematic. Biases are harder to estimate than than standard errors... The straightforward bias correxction can be dangerous to use in practice, due to high variability in bias. Correcting the bias may cause a large increase in the standard error, which in turn results in a larger rms... Proceed at your own risk... Cheers, Bert Gunter Genentech -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 7:44 AM To: Doran, Harold; r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal, With all due respect, you have openly acknowledged that you don't know enough about the subject at hand. If that is the case, on what basis are you in a position to challenge the collective wisdom of those professionals who have voluntarily offered *expert* level statistical advice to you? You have erected a wall around your thinking. You may choose to use R or any other software application to Git-R-Done. But that does not make it correct. There are other methods to consider that could be used during the model building process itself, rather than on a post-hoc basis and I would specifically refer you to Frank's book, Regression Modeling Strategies: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS Marc Schwartz on 07/22/2008 09:43 AM Michal Figurski wrote: Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
install.packages('fortunes') library(fortunes) fortune(28) -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 1:29 PM To: Michal Figurski Cc: Doran, Harold; r-help@r-project.org; Frank E Harrell Jr; Bert Gunter Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Michal, With all due respect, you have openly acknowledged that you don't know enough about the subject at hand. If that is the case, on what basis are you in a position to challenge the collective wisdom of those professionals who have voluntarily offered *expert* level statistical advice to you? You have erected a wall around your thinking. You may choose to use R or any other software application to Git-R-Done. But that does not make it correct. There are other methods to consider that could be used during the model building process itself, rather than on a post-hoc basis and I would specifically refer you to Frank's book, Regression Modeling Strategies: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS Marc Schwartz on 07/22/2008 09:43 AM Michal Figurski wrote: Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Dear Marc and all, Thank you for all the due respect. I tried to explain as much explicitly as I could what I am trying to do in my first email. I did not invent this procedure, it was already published in the paper: T. Pawinski, M. Hale, M. Korecka, W.E. Fitzsimmons, L.M. Shaw. Limited Sampling Strategy for the Estimation of Mycophenolic Acid Area under the Curve in Adult Renal Transplant Patients Treated with Concomitant Tacrolimus. Clinical Chemistry 2002(48:9), 1497-1504 I only adopted this methodology to work under SAS and now I try to do it under R, because I like R. I need a practical advice because I have a practical problem, and I do not understand much of the theoretical discussion on what bootstrap is suitable for or not. Apparently I am trying to use it for something else than the experts are used to... Honestly, I did not learn anything from this discussion so far, I am just disappointed. Though, since the discussion has already started, I'd welcome your criticism on this procedure - I just ask that you express it in human language. -- Michal J. Figurski Marc Schwartz wrote: Michal, With all due respect, you have openly acknowledged that you don't know enough about the subject at hand. If that is the case, on what basis are you in a position to challenge the collective wisdom of those professionals who have voluntarily offered *expert* level statistical advice to you? You have erected a wall around your thinking. You may choose to use R or any other software application to Git-R-Done. But that does not make it correct. There are other methods to consider that could be used during the model building process itself, rather than on a post-hoc basis and I would specifically refer you to Frank's book, Regression Modeling Strategies: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS Marc Schwartz on 07/22/2008 09:43 AM Michal Figurski wrote: Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal Figurski wrote: Dear Marc and all, Thank you for all the due respect. I tried to explain as much explicitly as I could what I am trying to do in my first email. I did not invent this procedure, it was already published in the paper: T. Pawinski, M. Hale, M. Korecka, W.E. Fitzsimmons, L.M. Shaw. Limited Sampling Strategy for the Estimation of Mycophenolic Acid Area under the Curve in Adult Renal Transplant Patients Treated with Concomitant Tacrolimus. Clinical Chemistry 2002(48:9), 1497-1504 If you send me a pdf of this paper I will be glad to take a look. Rather than an ad hoc bootstrap procedure you might look at the resistent/robust fit literature and use an objective function that spells out what is being optimized. There probably are cases where taking the median of a set of bootstrap regression coefficient estimates works well in a certain sense, but I would put my money on penalized maximum likelihood estimation. As Marc said, your attitude towards free advice is puzzling. Frank I only adopted this methodology to work under SAS and now I try to do it under R, because I like R. I need a practical advice because I have a practical problem, and I do not understand much of the theoretical discussion on what bootstrap is suitable for or not. Apparently I am trying to use it for something else than the experts are used to... Honestly, I did not learn anything from this discussion so far, I am just disappointed. Though, since the discussion has already started, I'd welcome your criticism on this procedure - I just ask that you express it in human language. -- Michal J. Figurski Marc Schwartz wrote: Michal, With all due respect, you have openly acknowledged that you don't know enough about the subject at hand. If that is the case, on what basis are you in a position to challenge the collective wisdom of those professionals who have voluntarily offered *expert* level statistical advice to you? You have erected a wall around your thinking. You may choose to use R or any other software application to Git-R-Done. But that does not make it correct. There are other methods to consider that could be used during the model building process itself, rather than on a post-hoc basis and I would specifically refer you to Frank's book, Regression Modeling Strategies: http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS Marc Schwartz on 07/22/2008 09:43 AM Michal Figurski wrote: Hmm... It sounds like ideology to me. I was asking for technical help. I know what I want to do, just don't know how to do it in R. I'll go back to SAS then. Thank you. -- Michal J. Figurski Doran, Harold wrote: I think the answer has been given to you. If you want to continue to ignore that advice and use bootstrap for point estimates rather than the properties of those estimates (which is what bootstrap is for) then you are on your own. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski Sent: Tuesday, July 22, 2008 9:52 AM To: r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Dear all, I don't want to argue with anybody about words or about what bootstrap is suitable for - I know too little for that. All I need is help to get the *equation coefficients* optimized by bootstrap - either by one of the functions or by simple median. Please help, -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No, not really. The bootstrap is a resampling method for variance estimation. It is often used when there is not an easy way, or a closed form expression, for estimating the sampling variance of a statistic. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Hi Doran, Maybe I am wrong, but I think bootstrap is a general resampling method which can be used for different purposes...Usually it works well when you do not have a presentative sample set (maybe with limited number of samples). Therefore, I am positive with Michal... P.S., overfitting, in my opinion, is used to depict when you got a model which is quite specific for the training dataset but cannot be generalized with new samples.. Thanks, --Jerry 2008/7/21 Doran, Harold [EMAIL PROTECTED]: I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No, not really. The bootstrap is a resampling method for variance estimation. It is often used when there is not an easy way, or a closed form expression, for estimating the sampling variance of a statistic. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Well, here is a good source--wikipedia. http://en.wikipedia.org/wiki/Bootstrapping_(statistics) From: Áõ½Ü [mailto:[EMAIL PROTECTED] Sent: Monday, July 21, 2008 3:56 PM To: Doran, Harold Cc: Michal Figurski; Frank E Harrell Jr; r-help@r-project.org Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them? Hi Doran, Maybe I am wrong, but I think bootstrap is a general resampling method which can be used for different purposes...Usually it works well when you do not have a presentative sample set (maybe with limited number of samples). Therefore, I am positive with Michal... P.S., overfitting, in my opinion, is used to depict when you got a model which is quite specific for the training dataset but cannot be generalized with new samples.. Thanks, --Jerry 2008/7/21 Doran, Harold [EMAIL PROTECTED]: I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No, not really. The bootstrap is a resampling method for variance estimation. It is often used when there is not an easy way, or a closed form expression, for estimating the sampling variance of a statistic. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal Figurski wrote: Frank, How does bootstrap improve on that? I don't know, but I have an idea. Since the data in my set are just a small sample of a big population, then if I use my whole dataset to obtain max likelihood estimates, these estimates may be best for this dataset, but far from ideal for the whole population. The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs. I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No When I use such median coefficients on another dataset (another sample from population), the predictions are better, than using max likelihood estimates. I have already tested that and it worked! Then your testing procedure is probably not valid. I am not a statistician and I don't feel what overfitting is, but it may be just another word for the same idea. Nevertheless, I would still like to know how can I get the coeffcients for the model that gives the nearly unbiased estimates. I greatly appreciate your help. More info in my book Regression Modeling Strategies. Frank -- Michal J. Figurski HUP, Pathology Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Frank E Harrell Jr wrote: Michal Figurski wrote: Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that? Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package Design, and I also experimented with function 'sm.binomial.bootstrap' from package sm. I tried also the function 'boot' from package boot, though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.