Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-08-01 Thread Michal Figurski

Dear all,

Your constant talking about what bootstrap is and is not suitable for 
made me finally verify the findings in the Pawinski et al paper.


Here is the procedure and the findings:
 - First of all I took the raw data (that was posted earlier on this 
list) and estimated the AUC values using equation coefficients of their 
recommended model (#10). Though, I was _unable to reproduce_ the r^2, 
nor the predictive performance values. My results are 0.74 and 44%, 
respectively, while the reported figures were 0.862 and 82% (41 profiles 
out of 50). My scatterplot also looks different than the Fig.2 model 10 
scatterplot. Weird...
 - Then, I fit the multiple linear model to the whole dataset (no 
bootstrap), using the time-points of model #10. I obtained r^2 of 0.74 
(agreement), mean prediction error of 7.4% +-28.3% and predictive 
performance of 44%. The mean reported prediction error (PE) was 7.6% 
+-26.7% and predictive performance: 56% (page 1502, second column, 
sentence 2nd from top)! I think the difference in PE may be attributed 
to numerical differences between SPSS and R, though I can't explain the 
difference in predictive performance.
 - Finally, I used Gustaf's bootstrap code to fit linear regression 
with model #10 time-points on the resampled dataset. The r^2 of the 
model with median coefficients was identical to that of the model fit to 
entire data, and the predictive performance was better by only one 
profile in the range: 46%. As you see, these figures are very far from 
the numbers reported in the paper. I will be in discussion with the 
authors on how they obtained these numbers, but I am having doubts if 
this paper is valid at all...
 - Later I tested it on my own dataset (paper to appear in August), and 
found that the MLR model fit on entire data has identical r^2 and 
predictive performance as the median coefficient model from bootstrap.


I must admit, guys, *that I was wrong and you were right: this 
bootstrap-like procedure does not improve predictions* - at least not to 
the extent reported in the Pawinski et al paper.


I was blindly believing in this paper and I am somewhat embarrassed that 
I didn't verify these findings, despite that their dataset was available 
to me since beginning. Maybe it was too much trust in printed word and 
in authority of a PhD biostatistician who devised the procedure...


Nevertheless, I am happy that at least this procedure is harmless, and 
that I can reproduce the figures reported in /my/ paper.


Best regards, and apologies for being such a hard student. I am being 
converted to orthodox statistics.


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

Gustaf Rydevik wrote:

On Thu, Jul 31, 2008 at 4:30 PM, Michal Figurski
[EMAIL PROTECTED] wrote:

Frank and all,

The point you were looking for was in a page that was linked from the
referenced page - I apologize for confusion. Please take a look at the two
last paragraphs here:
http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm

Though, possibly it's my ignorance, maybe it's yours, but you actually
missed the important point again. It is that you just don't estimate mean,
or CI, or variance on PK profile data! It is as if you were trying to
estimate mean, CI and variance of a Toccata__Fugue_in_D_minor.wav file.
What for? The point is in the music! Would the mean or CI or variance tell
you anything about that? Besides, everybody knows the variance (or
variability?) is there and can estimate it without spending time on
calculations.
What I am trying to do is comparable to compressing a wave into mp3 - to
predict the wave using as few data points as possible. I have a bunch of
similar waves and I'm trying to find a common equation to predict them all.
I am *not* looking for the variance of the mean!

I could be wrong (though it seems less and less likely), but you keep
talking about the same irrelevant parameters (CI, variance) on and on. Well,
yes - we are at a standstill, but not because of Davison  Hinkley's book. I
can try reading it, though as I stated above, it is not even remotely
related to what I am trying to do. I'll skip it then - life is too short.

Nevertheless I thank you (all) for relevant criticism on the procedure (in
the points where it was relevant). I plan to use this methodology further,
and it was good to find out that it withstood your criticism. I will look
into the penalized methods, though.

--
Michal J. Figurski



I take it you mean the sentence:

 For example, in here, the statistical estimator is  the sample mean.
Using bootstrap sampling, you can do beyond your statistical
estimators. You can now get even the distribution of your estimator
and the statistics (such as confidence interval, variance) of your
estimator.

Again you are misinterpreting text. The phrase about doing beyond
your statistical estimators, is 

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-31 Thread Michal Figurski

Frank and all,

The point you were looking for was in a page that was linked from the 
referenced page - I apologize for confusion. Please take a look at the 
two last paragraphs here: 
http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm


Though, possibly it's my ignorance, maybe it's yours, but you actually 
missed the important point again. It is that you just don't estimate 
mean, or CI, or variance on PK profile data! It is as if you were trying 
to estimate mean, CI and variance of a Toccata__Fugue_in_D_minor.wav 
file. What for? The point is in the music! Would the mean or CI or 
variance tell you anything about that? Besides, everybody knows the 
variance (or variability?) is there and can estimate it without spending 
time on calculations.
What I am trying to do is comparable to compressing a wave into mp3 - to 
predict the wave using as few data points as possible. I have a bunch of 
similar waves and I'm trying to find a common equation to predict them 
all. I am *not* looking for the variance of the mean!


I could be wrong (though it seems less and less likely), but you keep 
talking about the same irrelevant parameters (CI, variance) on and on. 
Well, yes - we are at a standstill, but not because of Davison  
Hinkley's book. I can try reading it, though as I stated above, it is 
not even remotely related to what I am trying to do. I'll skip it then 
- life is too short.


Nevertheless I thank you (all) for relevant criticism on the procedure 
(in the points where it was relevant). I plan to use this methodology 
further, and it was good to find out that it withstood your criticism. I 
will look into the penalized methods, though.


--
Michal J. Figurski


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Tim,

If I understand correctly, you are saying that one can't improve on 
estimating a mean by doing bootstrap and summarizing means of many 
such steps. As far as I understand (again), you're saying that this 
way one can only add bias without any improvement...


Well, this is in contradiction to some guides to bootstrap, that I 
found on the web (I did my homework), for example to this one:
http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap 
Statistic Mean.htm


Where on that web site does it state anything that is remotely related 
to your point?  It shows how to use the bootstrap to estimate the bias, 
does not show that the bias is important (it isn't; the simulation is 
from a normal distribution and the sample mean is perfectly unbiased; 
you are just seeing sampling error in the bias estimate).




It is all confusing, guys... Once somebody said, that there are as 
many opinions on a topic, as there are statisticians...


Also, translating your statements into the example of hammer and rock, 
you are saying that one cannot use hammer to break rocks because it 
was created to drive nails.


With all respect, despite my limited knowledge, I do not agree.
The big point is that the mean, or standard error, or confidence 
intervals of the data itself are *meaningless* in the pharmacokinetic 
dataset. These data are time series of a highly variable quantity, 
that is known to display a peak (or two in the case of Pawinski's 
paper). It is as if you tried to calculate a mean of a chromatogram 
(example for chemists, sorry).


Nevertheless, I thank all of you, experts, for your insight and 
advice. In the end, I learned a lot, though I keep my initial view. 
Summarizing your criticism of the procedure described in Pawinski's 
paper:


If you think that you can learn statistics easily when I would have a 
devil of a time learning chemistry, and if you are not willing to read 
for example the Davison and Hinkley bootstrap text, I guess we are at a 
standstill.


Frank Harrell

 - Some of you say that this isn't bootstrap at all. In terms of 
terminology I totally submit to that, because I know too little. Would 
anyone suggest a name?
 - Most of you say that this procedure is not the best one, that there 
are better ways. I will definitely do my homework on penalized 
regression, though no one of you has actually discredited this 
methodology. Therefore, though possibly not optimal, it remains valid.
 - The criticism on predictive performance is that one has to take 
into account also other important quantities, like bias, variance, 
etc. Fortunately I did that in my work: using RMSE and log residuals 
from the validation process. I just observed that models with 
relatively small RMSE and log residuals (compared to other models) 
usually possess good predictive performance. And vice versa.
Predictive performance has also a great advantage over RMSE or 
variance or anything else suggested here - it is easily understood by 
non-statisticians. I don't think it is /too simple/ in Einstein's 
terms, it's just simple.


Kind regards,

--
Michal J. Figurski


Tim Hesterberg wrote:

I'll address the question of whether you can use the bootstrap to
improve estimates, 

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-31 Thread Gustaf Rydevik
On Thu, Jul 31, 2008 at 4:30 PM, Michal Figurski
[EMAIL PROTECTED] wrote:
 Frank and all,

 The point you were looking for was in a page that was linked from the
 referenced page - I apologize for confusion. Please take a look at the two
 last paragraphs here:
 http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm

 Though, possibly it's my ignorance, maybe it's yours, but you actually
 missed the important point again. It is that you just don't estimate mean,
 or CI, or variance on PK profile data! It is as if you were trying to
 estimate mean, CI and variance of a Toccata__Fugue_in_D_minor.wav file.
 What for? The point is in the music! Would the mean or CI or variance tell
 you anything about that? Besides, everybody knows the variance (or
 variability?) is there and can estimate it without spending time on
 calculations.
 What I am trying to do is comparable to compressing a wave into mp3 - to
 predict the wave using as few data points as possible. I have a bunch of
 similar waves and I'm trying to find a common equation to predict them all.
 I am *not* looking for the variance of the mean!

 I could be wrong (though it seems less and less likely), but you keep
 talking about the same irrelevant parameters (CI, variance) on and on. Well,
 yes - we are at a standstill, but not because of Davison  Hinkley's book. I
 can try reading it, though as I stated above, it is not even remotely
 related to what I am trying to do. I'll skip it then - life is too short.

 Nevertheless I thank you (all) for relevant criticism on the procedure (in
 the points where it was relevant). I plan to use this methodology further,
 and it was good to find out that it withstood your criticism. I will look
 into the penalized methods, though.

 --
 Michal J. Figurski


I take it you mean the sentence:

 For example, in here, the statistical estimator is  the sample mean.
Using bootstrap sampling, you can do beyond your statistical
estimators. You can now get even the distribution of your estimator
and the statistics (such as confidence interval, variance) of your
estimator.

Again you are misinterpreting text. The phrase about doing beyond
your statistical estimators, is explained in the next sentence, where
he says that using bootstrap gives you information about the mean
*estimator* (and not more information about the population mean).
And since you're not interested in this information, in your case
bootstrap/resampling is not useful at all.

As another example of misinterpretation: In your email from  a week
ago, it sounds like you believe that the authors of the original paper
are trying to improve on a fixed model
Figurski:
Regarding the multiple stepwise regression - according to the cited
SPSS manual, there are 5 options to select from. I don't think they used
'stepwise selection' option, because their models were already
pre-defined. Variables were pre-selected based on knowledge of
pharmacokinetics of this drug and other factors. I think this part I
understand pretty well.

This paragraph is wrong. Sorry, no way around it.

Quoting from the paper Pawinski etal:
  *__Twenty-six(!)* 1-, 2-, or 3-sample estimation
models were fit (r2  0.341– 0.862) to a randomly
selected subset of the profiles using linear regression
and were used to estimate AUC0–12h for the profiles not
included in the regression fit, comparing those estimates
with the corresponding AUC0–12h values, calculated
with the linear trapezoidal rule, including all 12
timed MPA concentrations. The 3-sample models were
constrained to include no samples past 2 h.
(emph. mine)

They clearly state that they are choosing among 26 different models by
using their bootstrap-like procedure, not improving on a single,
predefined model.
This procedure is statistically sound (more or less at least), and not
controversial.

However, (again) what you are wanting to do is *not* what they did in
their paper!
resampling can not improve on the performance of a pre-specified
model. This is intuitively obvious, but moreover its mathematically
provable! That's why we're so certain of our standpoint. If you really
wish, I (or someone else) could write out a proof, but I'm unsure if
you would be able to follow.

In the end, it doesn't really matter. What you are doing amounts to
doing a regression 50 times, when once would suffice. No big harm
done, just a bit of unnecessary work. And proof to a statistically
competent reviewer that you don't really understand what you're doing.
The better option would be to either study some more statistics
yourself, or find a statistician that can do your analysis for you,
and trust him to do it right.

Anyhow, good luck with your research.

Best regards,

Gustaf

-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the 

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-31 Thread Michal Figurski

Gustaf,

Summarizing things I don't understand:
 - Honestly, I was thinking I can use bootstrap to obtain better
estimate of a mean - provided that I want it. So, I can't?
 - If I can't obtain reliable estimates of CI and variance from a small
dataset, but I can do it with bootstrap - isn't it a virtual increase
of the size of dataset? OK, these are just words, I won't fight for that.
 - I don't understand why a procedure works for 26 models and doesn't
work for one... Intuitively this doesn't make sense...
 - I don't understand why resampling *cannot* improve... while it does?
I know the proof is going to be hard to follow, but let me try! (The
proof of the opposite is in the paper).
 - I truly don't understand what I don't understand about what I am
doing. This is getting too much convoluted for me...

And a remark about what I don't agree with Gustaf:

The text below, quoted from Pawinski et al (Twenty six...), is missing
an important information - that they repeated that step 50 times - each
time with randomly selected subset. Excuse my ignorance again, but
this looks like bootstrap (re-sampling), doesn't it? Although I won't
argue for names.

I want to assure everyone here that I did *exactly* what they did. I
work in the same lab, that this paper came from, and I just had their
procedure in SPSS translated to SAS. Moreover, the translation was done
with help of a _trustworthy biostatistician_ - I was not that good with
SAS at the time to do it myself. The biostatistician wrote the
randomization and regression subroutines. I later improved them using
macros (less code) and added validation part. It was then approved by
that biostatistician.
OK, I did not exactly do the same, because I repeated the step 100 times
for 34 *pre-defined* models and on a different dataset. But that's about
all the difference.

I hope this solves everyone's dilemma whether I did what is described in
Pawinski's paper or not.

This discussion, though, started with my question on: how to do it in R,
instead of SAS, and with logistic (not linear) regression. Thank you,
Gustaf, for the code - this was the help I needed.

--
Michal J. Figurski


Gustaf Rydevik wrote:


 For example, in here, the statistical estimator is  the sample mean.
Using bootstrap sampling, you can do beyond your statistical
estimators. You can now get even the distribution of your estimator
and the statistics (such as confidence interval, variance) of your
estimator.

Again you are misinterpreting text. The phrase about doing beyond
your statistical estimators, is explained in the next sentence, where
he says that using bootstrap gives you information about the mean
*estimator* (and not more information about the population mean).
And since you're not interested in this information, in your case
bootstrap/resampling is not useful at all.

As another example of misinterpretation: In your email from  a week
ago, it sounds like you believe that the authors of the original paper
are trying to improve on a fixed model
Figurski:
Regarding the multiple stepwise regression - according to the cited
SPSS manual, there are 5 options to select from. I don't think they used
'stepwise selection' option, because their models were already
pre-defined. Variables were pre-selected based on knowledge of
pharmacokinetics of this drug and other factors. I think this part I
understand pretty well.

This paragraph is wrong. Sorry, no way around it.

Quoting from the paper Pawinski etal:
  *__Twenty-six(!)* 1-, 2-, or 3-sample estimation
models were fit (r2  0.341� 0.862) to a randomly
selected subset of the profiles using linear regression
and were used to estimate AUC0�12h for the profiles not
included in the regression fit, comparing those estimates
with the corresponding AUC0�12h values, calculated
with the linear trapezoidal rule, including all 12
timed MPA concentrations. The 3-sample models were
constrained to include no samples past 2 h.
(emph. mine)

They clearly state that they are choosing among 26 different models by
using their bootstrap-like procedure, not improving on a single,
predefined model.
This procedure is statistically sound (more or less at least), and not
controversial.

However, (again) what you are wanting to do is *not* what they did in
their paper!

resampling can not improve on the performance of a pre-specified
model. This is intuitively obvious, but moreover its mathematically
provable! That's why we're so certain of our standpoint. If you really
wish, I (or someone else) could write out a proof, but I'm unsure if
you would be able to follow.

In the end, it doesn't really matter. What you are doing amounts to
doing a regression 50 times, when once would suffice. No big harm
done, just a bit of unnecessary work. And proof to a statistically
competent reviewer that you don't really understand what you're doing.
The better option would be to either study some more statistics
yourself, or find a statistician that can do your analysis for 

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-30 Thread Michal Figurski

Tim,

If I understand correctly, you are saying that one can't improve on 
estimating a mean by doing bootstrap and summarizing means of many such 
steps. As far as I understand (again), you're saying that this way one 
can only add bias without any improvement...


Well, this is in contradiction to some guides to bootstrap, that I found 
on the web (I did my homework), for example to this one:
http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap 
Statistic Mean.htm


It is all confusing, guys... Once somebody said, that there are as many 
opinions on a topic, as there are statisticians...


Also, translating your statements into the example of hammer and rock, 
you are saying that one cannot use hammer to break rocks because it was 
created to drive nails.


With all respect, despite my limited knowledge, I do not agree.
The big point is that the mean, or standard error, or confidence 
intervals of the data itself are *meaningless* in the pharmacokinetic 
dataset. These data are time series of a highly variable quantity, that 
is known to display a peak (or two in the case of Pawinski's paper). It 
is as if you tried to calculate a mean of a chromatogram (example for 
chemists, sorry).


Nevertheless, I thank all of you, experts, for your insight and advice. 
In the end, I learned a lot, though I keep my initial view. Summarizing 
your criticism of the procedure described in Pawinski's paper:
 - Some of you say that this isn't bootstrap at all. In terms of 
terminology I totally submit to that, because I know too little. Would 
anyone suggest a name?
 - Most of you say that this procedure is not the best one, that there 
are better ways. I will definitely do my homework on penalized 
regression, though no one of you has actually discredited this 
methodology. Therefore, though possibly not optimal, it remains valid.
 - The criticism on predictive performance is that one has to take 
into account also other important quantities, like bias, variance, etc. 
Fortunately I did that in my work: using RMSE and log residuals from the 
validation process. I just observed that models with relatively small 
RMSE and log residuals (compared to other models) usually possess good 
predictive performance. And vice versa.
Predictive performance has also a great advantage over RMSE or variance 
or anything else suggested here - it is easily understood by 
non-statisticians. I don't think it is /too simple/ in Einstein's terms, 
it's just simple.


Kind regards,

--
Michal J. Figurski


Tim Hesterberg wrote:

I'll address the question of whether you can use the bootstrap to
improve estimates, and whether you can use the bootstrap to virtually
increase the size of the sample.

Short answer - no, with some exceptions (bumping / Random Forests).

Longer answer:
Suppose you have data (x1, ..., xn) and a statistic ThetaHat,
that you take a number of bootstrap samples (all of size n) and
let ThetaHatBar be the average of those bootstrap statistics from
those samples.

Is ThetaHatBar better than ThetaHat?  Usually not.  Usually it
is worse.  You have not collected any new data, you are just using the
existing data in a different way, that is usually harmful:
* If the statistic is the sample mean, all this does is to add
  some noise to the estimate
* If the statistic is nonlinear, this gives an estimate that
  has roughly double the bias, without improving the variance.

What are the exceptions?  The prime example is tree models (random
forests) - taking bootstrap averages helps smooth out the
discontinuities in tree models.  For a simple example, suppose that a
simple linear regression model really holds:
y = beta x + epsilon
but that you fit a tree model; the tree model predictions are
a step function.  If you bootstrap the data, the boundaries of
the step function will differ from one sample to another, so
the average of the bootstrap samples smears out the steps, getting
closer to the smooth linear relationship.

Aside from such exceptions, the bootstrap is used for inference
(bias, standard error, confidence intervals), not improving on
ThetaHat.

Tim Hesterberg


Hi Doran,

Maybe I am wrong, but I think bootstrap is a general resampling method which
can be used for different purposes...Usually it works well when you do not
have a presentative sample set (maybe with limited number of samples).
Therefore, I am positive with Michal...

P.S., overfitting, in my opinion, is used to depict when you got a model
which is quite specific for the training dataset but cannot be generalized
with new samples..

Thanks,

--Jerry
2008/7/21 Doran, Harold [EMAIL PROTECTED]:


I used bootstrap to virtually increase the size of my
dataset, it should result in estimates more close to that
from the population - isn't it the purpose of bootstrap?

No, not really. The bootstrap is a resampling method for variance
estimation. It is often used when there is not an easy way, or a closed
form expression, for estimating the 

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-30 Thread Frank E Harrell Jr

Michal Figurski wrote:

Tim,

If I understand correctly, you are saying that one can't improve on 
estimating a mean by doing bootstrap and summarizing means of many such 
steps. As far as I understand (again), you're saying that this way one 
can only add bias without any improvement...


Well, this is in contradiction to some guides to bootstrap, that I found 
on the web (I did my homework), for example to this one:
http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap 
Statistic Mean.htm


Where on that web site does it state anything that is remotely related 
to your point?  It shows how to use the bootstrap to estimate the bias, 
does not show that the bias is important (it isn't; the simulation is 
from a normal distribution and the sample mean is perfectly unbiased; 
you are just seeing sampling error in the bias estimate).




It is all confusing, guys... Once somebody said, that there are as many 
opinions on a topic, as there are statisticians...


Also, translating your statements into the example of hammer and rock, 
you are saying that one cannot use hammer to break rocks because it was 
created to drive nails.


With all respect, despite my limited knowledge, I do not agree.
The big point is that the mean, or standard error, or confidence 
intervals of the data itself are *meaningless* in the pharmacokinetic 
dataset. These data are time series of a highly variable quantity, that 
is known to display a peak (or two in the case of Pawinski's paper). It 
is as if you tried to calculate a mean of a chromatogram (example for 
chemists, sorry).


Nevertheless, I thank all of you, experts, for your insight and advice. 
In the end, I learned a lot, though I keep my initial view. Summarizing 
your criticism of the procedure described in Pawinski's paper:


If you think that you can learn statistics easily when I would have a 
devil of a time learning chemistry, and if you are not willing to read 
for example the Davison and Hinkley bootstrap text, I guess we are at a 
standstill.


Frank Harrell

 - Some of you say that this isn't bootstrap at all. In terms of 
terminology I totally submit to that, because I know too little. Would 
anyone suggest a name?
 - Most of you say that this procedure is not the best one, that there 
are better ways. I will definitely do my homework on penalized 
regression, though no one of you has actually discredited this 
methodology. Therefore, though possibly not optimal, it remains valid.
 - The criticism on predictive performance is that one has to take 
into account also other important quantities, like bias, variance, etc. 
Fortunately I did that in my work: using RMSE and log residuals from the 
validation process. I just observed that models with relatively small 
RMSE and log residuals (compared to other models) usually possess good 
predictive performance. And vice versa.
Predictive performance has also a great advantage over RMSE or variance 
or anything else suggested here - it is easily understood by 
non-statisticians. I don't think it is /too simple/ in Einstein's terms, 
it's just simple.


Kind regards,

--
Michal J. Figurski


Tim Hesterberg wrote:

I'll address the question of whether you can use the bootstrap to
improve estimates, and whether you can use the bootstrap to virtually
increase the size of the sample.

Short answer - no, with some exceptions (bumping / Random Forests).

Longer answer:
Suppose you have data (x1, ..., xn) and a statistic ThetaHat,
that you take a number of bootstrap samples (all of size n) and
let ThetaHatBar be the average of those bootstrap statistics from
those samples.

Is ThetaHatBar better than ThetaHat?  Usually not.  Usually it
is worse.  You have not collected any new data, you are just using the
existing data in a different way, that is usually harmful:
* If the statistic is the sample mean, all this does is to add
  some noise to the estimate
* If the statistic is nonlinear, this gives an estimate that
  has roughly double the bias, without improving the variance.

What are the exceptions?  The prime example is tree models (random
forests) - taking bootstrap averages helps smooth out the
discontinuities in tree models.  For a simple example, suppose that a
simple linear regression model really holds:
y = beta x + epsilon
but that you fit a tree model; the tree model predictions are
a step function.  If you bootstrap the data, the boundaries of
the step function will differ from one sample to another, so
the average of the bootstrap samples smears out the steps, getting
closer to the smooth linear relationship.

Aside from such exceptions, the bootstrap is used for inference
(bias, standard error, confidence intervals), not improving on
ThetaHat.

Tim Hesterberg


Hi Doran,

Maybe I am wrong, but I think bootstrap is a general resampling 
method which
can be used for different purposes...Usually it works well when you 
do not

have a presentative sample set (maybe with limited number of samples).

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-27 Thread Tim Hesterberg
I'll address the question of whether you can use the bootstrap to
improve estimates, and whether you can use the bootstrap to virtually
increase the size of the sample.

Short answer - no, with some exceptions (bumping / Random Forests).

Longer answer:
Suppose you have data (x1, ..., xn) and a statistic ThetaHat,
that you take a number of bootstrap samples (all of size n) and
let ThetaHatBar be the average of those bootstrap statistics from
those samples.

Is ThetaHatBar better than ThetaHat?  Usually not.  Usually it
is worse.  You have not collected any new data, you are just using the
existing data in a different way, that is usually harmful:
* If the statistic is the sample mean, all this does is to add
  some noise to the estimate
* If the statistic is nonlinear, this gives an estimate that
  has roughly double the bias, without improving the variance.

What are the exceptions?  The prime example is tree models (random
forests) - taking bootstrap averages helps smooth out the
discontinuities in tree models.  For a simple example, suppose that a
simple linear regression model really holds:
y = beta x + epsilon
but that you fit a tree model; the tree model predictions are
a step function.  If you bootstrap the data, the boundaries of
the step function will differ from one sample to another, so
the average of the bootstrap samples smears out the steps, getting
closer to the smooth linear relationship.

Aside from such exceptions, the bootstrap is used for inference
(bias, standard error, confidence intervals), not improving on
ThetaHat.

Tim Hesterberg

Hi Doran,

Maybe I am wrong, but I think bootstrap is a general resampling method which
can be used for different purposes...Usually it works well when you do not
have a presentative sample set (maybe with limited number of samples).
Therefore, I am positive with Michal...

P.S., overfitting, in my opinion, is used to depict when you got a model
which is quite specific for the training dataset but cannot be generalized
with new samples..

Thanks,

--Jerry
2008/7/21 Doran, Harold [EMAIL PROTECTED]:

  I used bootstrap to virtually increase the size of my
  dataset, it should result in estimates more close to that
  from the population - isn't it the purpose of bootstrap?

 No, not really. The bootstrap is a resampling method for variance
 estimation. It is often used when there is not an easy way, or a closed
 form expression, for estimating the sampling variance of a statistic.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-24 Thread Michal Figurski

Greg and all,

Just another thought on bias and variability. As I tried to explain, I 
perceive this problem as a very practical problem.


The equation, that is the goal of this work, is supposed to serve the 
clinicians to estimate a pharmacokinetic parameter. It therefore must be 
simple and also presented in simple language, so that an average 
spreadsheet user can make use of it.


Therefore, in the end, isn't the *predictive performance* an ultimate 
measure of it all? Doesn't it account for bias and all the other stuff? 
It does tell you in how many cases you may expect to have the predicted 
value within 15% of the true value.
I apologize for my naive questions again, but aren't then the 
calculations of bias and variance, etc, just a waste of time, while you 
have it all summarized in the predictive performance?


--
Michal J. Figurski

Greg Snow wrote:

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
Sent: Wednesday, July 23, 2008 10:22 AM
To: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from
bootstrap - how to get them?

Thank you all for your words of wisdom.

I start getting into what you mean by bootstrap. Not
surprisingly, it seems to be something else than I do. The
bootstrap is a tool, and I would rather compare it to a
hammer than to a gun. People say that hammer is for driving
nails. This situation is as if I planned to use it to break rocks.


The bootstrap is more like a whole toolbox than just a single tool.  I think 
part of the confusion in this discussion is because you kept asking for a 
hammer and Frank and others kept looking at their toolbox full of hammers and 
asking you which one you wanted.  Yes you can break a rock with a hammer 
designed to drive nails, but why not use the hammer designed to break rocks 
when it is easily available.


The key point is that I don't really care about the bias or
variance of the mean in the model. These things are useful
for statisticians; regular people (like me, also a chemist)
do not understand them and have no use for them (well, now I
somewhat understand). My goal is very
practical: I need an equation that can predict patient's
outcome, based on some data, with maximum reliability and accuracy.


But to get the model with maximum reliability and accuracy you need to account 
for bias and minimize variability.  You may not care what those numbers are 
directly, but you do care indirectly about their influence on your final model. 
 Another instance where both sides were talking past each other.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-24 Thread Bert Gunter
To quote (or as nearly so as I can) Einstein's famous remark:

Make everything as simple as possible ... but no simpler 

Moreover, as possible here means maintaining fidelity to scientific
validity, not simple enough for me to understand. So I don't think a
physicist can explain relativistic cosmology to me (or an organic chemist,
how to synthesize ketones) so that I can understand it without compromising
scientific validity. The onus is then on me to either learn what I need to
know to understand it, or accept the authoritative view of the physicist (or
chemist). I cannot claim ignorance and reject the cosmology because it is
beyond me. That's the flat earth philosophy of science, and it is a
terrible obstacle to scientific progress and human enlightenment, in
general.

Cheers,
Bert Gunter


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Michal Figurski
Sent: Thursday, July 24, 2008 8:03 AM
Cc: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to
get them?

Greg and all,

Just another thought on bias and variability. As I tried to explain, I 
perceive this problem as a very practical problem.

The equation, that is the goal of this work, is supposed to serve the 
clinicians to estimate a pharmacokinetic parameter. It therefore must be 
simple and also presented in simple language, so that an average 
spreadsheet user can make use of it.

Therefore, in the end, isn't the *predictive performance* an ultimate 
measure of it all? Doesn't it account for bias and all the other stuff? 
It does tell you in how many cases you may expect to have the predicted 
value within 15% of the true value.
I apologize for my naive questions again, but aren't then the 
calculations of bias and variance, etc, just a waste of time, while you 
have it all summarized in the predictive performance?

--
Michal J. Figurski

Greg Snow wrote:
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
 Sent: Wednesday, July 23, 2008 10:22 AM
 To: r-help@r-project.org
 Subject: Re: [R] Coefficients of Logistic Regression from
 bootstrap - how to get them?

 Thank you all for your words of wisdom.

 I start getting into what you mean by bootstrap. Not
 surprisingly, it seems to be something else than I do. The
 bootstrap is a tool, and I would rather compare it to a
 hammer than to a gun. People say that hammer is for driving
 nails. This situation is as if I planned to use it to break rocks.
 
 The bootstrap is more like a whole toolbox than just a single tool.  I
think part of the confusion in this discussion is because you kept asking
for a hammer and Frank and others kept looking at their toolbox full of
hammers and asking you which one you wanted.  Yes you can break a rock with
a hammer designed to drive nails, but why not use the hammer designed to
break rocks when it is easily available.
 
 The key point is that I don't really care about the bias or
 variance of the mean in the model. These things are useful
 for statisticians; regular people (like me, also a chemist)
 do not understand them and have no use for them (well, now I
 somewhat understand). My goal is very
 practical: I need an equation that can predict patient's
 outcome, based on some data, with maximum reliability and accuracy.
 
 But to get the model with maximum reliability and accuracy you need to
account for bias and minimize variability.  You may not care what those
numbers are directly, but you do care indirectly about their influence on
your final model.  Another instance where both sides were talking past each
other.
 
 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 [EMAIL PROTECTED]
 (801) 408-8111

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-24 Thread Michal Figurski
What are the arguments against fidelity of this concept to scientific 
validity?


The concept of predictive performance was devised by one of you, 
biostatisticians - not me! I accept the authoritative view of the person 
that did it, especially because I do understand it.


When I think of it, excuse my ignorance, it looks to me that this 
measure summarizes effects of bias, variance, etc, and all the 
analytical and other errors. Please correct me if I am wrong, but spare 
me your sarcasm.


--
Michal J. Figurski

Bert Gunter wrote:

To quote (or as nearly so as I can) Einstein's famous remark:

Make everything as simple as possible ... but no simpler 


Moreover, as possible here means maintaining fidelity to scientific
validity, not simple enough for me to understand. So I don't think a
physicist can explain relativistic cosmology to me (or an organic chemist,
how to synthesize ketones) so that I can understand it without compromising
scientific validity. The onus is then on me to either learn what I need to
know to understand it, or accept the authoritative view of the physicist (or
chemist). I cannot claim ignorance and reject the cosmology because it is
beyond me. That's the flat earth philosophy of science, and it is a
terrible obstacle to scientific progress and human enlightenment, in
general.

Cheers,
Bert Gunter


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Michal Figurski
Sent: Thursday, July 24, 2008 8:03 AM
Cc: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to
get them?

Greg and all,

Just another thought on bias and variability. As I tried to explain, I 
perceive this problem as a very practical problem.


The equation, that is the goal of this work, is supposed to serve the 
clinicians to estimate a pharmacokinetic parameter. It therefore must be 
simple and also presented in simple language, so that an average 
spreadsheet user can make use of it.


Therefore, in the end, isn't the *predictive performance* an ultimate 
measure of it all? Doesn't it account for bias and all the other stuff? 
It does tell you in how many cases you may expect to have the predicted 
value within 15% of the true value.
I apologize for my naive questions again, but aren't then the 
calculations of bias and variance, etc, just a waste of time, while you 
have it all summarized in the predictive performance?


--
Michal J. Figurski

Greg Snow wrote:

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
Sent: Wednesday, July 23, 2008 10:22 AM
To: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from
bootstrap - how to get them?

Thank you all for your words of wisdom.

I start getting into what you mean by bootstrap. Not
surprisingly, it seems to be something else than I do. The
bootstrap is a tool, and I would rather compare it to a
hammer than to a gun. People say that hammer is for driving
nails. This situation is as if I planned to use it to break rocks.

The bootstrap is more like a whole toolbox than just a single tool.  I

think part of the confusion in this discussion is because you kept asking
for a hammer and Frank and others kept looking at their toolbox full of
hammers and asking you which one you wanted.  Yes you can break a rock with
a hammer designed to drive nails, but why not use the hammer designed to
break rocks when it is easily available.

The key point is that I don't really care about the bias or
variance of the mean in the model. These things are useful
for statisticians; regular people (like me, also a chemist)
do not understand them and have no use for them (well, now I
somewhat understand). My goal is very
practical: I need an equation that can predict patient's
outcome, based on some data, with maximum reliability and accuracy.

But to get the model with maximum reliability and accuracy you need to

account for bias and minimize variability.  You may not care what those
numbers are directly, but you do care indirectly about their influence on
your final model.  Another instance where both sides were talking past each
other.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-24 Thread John Kane



--- On Thu, 7/24/08, Michal Figurski [EMAIL PROTECTED] wrote:

 From: Michal Figurski [EMAIL PROTECTED]
 Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to 
 get them?
 To: 
 Cc: r-help@r-project.org r-help@r-project.org
 Received: Thursday, July 24, 2008, 11:02 AM
 Greg and all,
 
 Just another thought on bias and variability. As I tried to
 explain, I 
 perceive this problem as a very practical problem.
 
 The equation, that is the goal of this work, is supposed to
 serve the 
 clinicians to estimate a pharmacokinetic parameter. It
 therefore must be 
 simple and also presented in simple language, so that an
 average 
 spreadsheet user can make use of it.
 
 Therefore, in the end, isn't the *predictive
 performance* an ultimate 
 measure of it all? Doesn't it account for bias and all
 the other stuff? 

I think you need to look at Greg Snow's comment again. I am not a statistician 
but 

Greg says: 

But to get the model with maximum reliability and accuracy you need to account 
for bias and minimize variability.

As I read it, your predictive validity is partly a function of how well you 
account for bia and minimize variablility.  

Prediction may be the desired outcome but you don't get the best possible 
outcome unless you manage to account for these issues.  


 It does tell you in how many cases you may expect to have
 the predicted 
 value within 15% of the true value.
 I apologize for my naive questions again, but aren't
 then the 
 calculations of bias and variance, etc, just a waste of
 time, while you 
 have it all summarized in the predictive performance?
 
 --
 Michal J. Figurski
 
 Greg Snow wrote:
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of
 Michal Figurski
  Sent: Wednesday, July 23, 2008 10:22 AM
  To: r-help@r-project.org
  Subject: Re: [R] Coefficients of Logistic
 Regression from
  bootstrap - how to get them?
 
  Thank you all for your words of wisdom.
 
  I start getting into what you mean by bootstrap.
 Not
  surprisingly, it seems to be something else than I
 do. The
  bootstrap is a tool, and I would rather compare it
 to a
  hammer than to a gun. People say that hammer is
 for driving
  nails. This situation is as if I planned to use it
 to break rocks.
  
  The bootstrap is more like a whole toolbox than just a
 single tool.  I think part of the confusion in this
 discussion is because you kept asking for a hammer and
 Frank and others kept looking at their toolbox full of
 hammers and asking you which one you wanted.  Yes you can
 break a rock with a hammer designed to drive nails, but why
 not use the hammer designed to break rocks when it is easily
 available.
  
  The key point is that I don't really care
 about the bias or
  variance of the mean in the model. These things
 are useful
  for statisticians; regular people (like me, also a
 chemist)
  do not understand them and have no use for them
 (well, now I
  somewhat understand). My goal is very
  practical: I need an equation that can predict
 patient's
  outcome, based on some data, with maximum
 reliability and accuracy.
  
  But to get the model with maximum reliability and
 accuracy you need to account for bias and minimize
 variability.  You may not care what those numbers are
 directly, but you do care indirectly about their influence
 on your final model.  Another instance where both sides
 were talking past each other.
  
  --
  Gregory (Greg) L. Snow Ph.D.
  Statistical Data Center
  Intermountain Healthcare
  [EMAIL PROTECTED]
  (801) 408-8111
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.


  __
[[elided Yahoo spam]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Mark Difford

Hi All,

It really comes down to a question of attitude: you either want to learn
something fundamental or core and so bootstrap yourself to a better place
(at least away from where you are), or you don't. As Marc said, Michal seems
to have erected a wall around his thinking.

I don't think it's fair to take pot shots at Frank for not wanting to
promote or further something he doesn't believe in. He's a regular
contributor to the list, who gives sound advice. He's also one of the few
experts on the list who is prepared to give statistical advice.

Regards, Mark.


Rolf Turner-3 wrote:
 
 
 On 23/07/2008, at 1:17 PM, Frank E Harrell Jr wrote:
 
 Michal Figurski wrote:
 Hmm...
 It sounds like ideology to me. I was asking for technical help. I  
 know what I want to do, just don't know how to do it in R. I'll go  
 back to SAS then. Thank you.
 -- 
 Michal J. Figurski

 You don't understand any of the theory and you are using techniques  
 you don't understand and have provided no motivation for.  And you  
 are the one who is frustrated with others.  Wow.
 
   Come off it guys.  It is indeed very frustrating when one asks ``How  
 can I do X''
   and gets told ``Don't do X, do Y.''  It may well be the case that  
 doing X is
   wrong-headed, misleading, and may cause the bridge to fall down, or  
 the world to
   come to an end.  Fair enough to point this out --- but then why not  
 just tell
   the poor beggar, who asked, how to do X?
 
   The only circumstance in which *not* telling the poor beggar how to  
 do X is
   justified is that in which it takes considerable *work* to figure  
 out how to
   do X.  In this case it is perfectly reasonable to say ``I think  
 doing X is
   stupid so I am not going to waste my time figuring out for you how  
 to do it.''
 
   I don't know enough about the bootstrapping software (don't know  
 *anything*
   about it actually) to know whether the foregoing circumstance  
 applies here.
   But I suspect it doesn't.  And I suspect that you (Frank) could tell  
 Michal in
   a few lines the answer to the question that he *asked* (as opposed,  
 possibly,
   to the question that he should have asked).
 
   If it were my problem I'd just write my own bootstrapping function  
 to apply
   to the problem in hand.  It can't be that hard ... just a for loop  
 and a
   call to sample(...,replace=TRUE).
 
   If you can write macros in SAS then .
 
   cheers,
 
   Rolf
 
 ##
 Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Coefficients-of-Logistic-Regression-from-bootstrap---how-to-get-them--tp18570684p18605881.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Michal Figurski

Thank you Gustaf,

I apologize for not including an example data in my first email. 
Nevertheless, your code worked for me excellently - I only added 55 as 
the size of sample.


I must admit this code looks so much simpler, compared to SAS. I am 
beginning to love R, despite some disrespectful experts in this forum.


--
Michal J. Figurski

Gustaf Rydevik wrote:


figurski.df-data.frame(name=1:109,num1=rnorm(109),num2=rnorm(109),num3=rnorm(109),outcome=sample(c(1,0),109,replace=T))
library(Design)
lrm(outcome~num1+num2+num3,data=figurski.df)$coef
coef-list()
for (i in 1:100){
tempData-figurski.df[sample(1:109,replace=T),]
coef[[i]]-lrm(outcome~num1+num2+num3,data=tempData)$coef
}
coef.df-data.frame(do.call(rbind,coef))
median(coef.df$num1)



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Johannes Huesing
It seems you have accidentally hit a surgeons' mailing list where all
you wanted was some advice how to use this scalpel on your body.
Sorry if we can't be of any help without intimidating you with
unrelated and pompous terms -- like coagulation.

Michal J. Figurski [EMAIL PROTECTED] [Wed, Jul 23, 2008 at 04:54:36AM CEST]:
 Dear all,
 
 Since you guys are frank, let me be frank as well. I did not ask anyone to
 impose on me their point of view on bootstrap. It's my impression that this is
 what you guys are trying to do - that's sad. Some of your emails in this
 discussion are worth less than junk mail - particularly the ones from Mr 
 Harold
 Doran. It's even more sad that you use junior members of this forum to make 
 fun
 and intimidate.
 
 Apparently, even with all your expertise and education in this area, many of 
 you
 - experts - do not understand what I am talking about. You seem to be so much
 affixed to your expertise, that you can't see anything beyond it.
 

-- 
Johannes Hüsing   There is something fascinating about science. 
  One gets such wholesale returns of conjecture 
mailto:[EMAIL PROTECTED]  from such a trifling investment of fact.  
  
http://derwisch.wikidot.com (Mark Twain, Life on the Mississippi)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Gustaf Rydevik
On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
[EMAIL PROTECTED] wrote:
 Gustaf,

 I am sorry, but I don't get the point. Let's just focus on predictive
 performance from the cited passage, that is the number of values predicted
 within 15% of the original value.
 So, the predictive performance from the model fit on entire dataset was 56%
 of profiles, while from bootstrapped model it was 82% of profiles. Well - I
 see a stunning purpose in the bootstrap step here: it turns an useless
 equation into a clinically applicable model!

 Honestly, I also can't see how this can be better than fitting on entire
 dataset, but here you have a proof that it is.

 I think that another argument supporting this approach is model validation.
 If you fit model on entire data, you have no data left to validate its
 predictions.

 On the other hand, I agree with you that the passage in methods section
 looks awkward.

 In my work on a similar problem, that is going to appear in August in Ther
 Drug Monit, I used medians since beginning and all the comparisons were done
 based on models with median coefficients. I think this is what the authors
 of that paper did, though they might just have had a problem with describing
 it correctly, and unfortunately it passed through review process unchanged.




Hi,

I believe that you misunderstand the passage. Do you know what
multiple stepwise regression is?

Since they used SPSS, I copied from
http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm

Stepwise selection is a combination of forward and backward procedures.
Step 1

The first predictor variable is selected in the same way as in forward
selection. If the probability associated with the test of significance
is less than or equal to the default .05, the predictor variable with
the largest correlation with the criterion variable enters the
equation first.


Step 2

The second variable is selected based on the highest partial
correlation. If it can pass the entry requirement (PIN=.05), it also
enters the equation.

Step 3

From this point, stepwise selection differs from forward selection:
the variables already in the equation are examined for removal
according to the removal criterion (POUT=.10) as in backward
elimination.

Step 4

Variables not in the equation are examined for entry. Variable
selection ends when no more variables meet entry and removal criteria.
---


It is the outcome of this *entire process*,step1-4, that they compare
with the outcome of their *entire bootstrap/crossvalidation/selection
process*, Step1-4 in the methods section, and find that their approach
gives better result
What you are doing is only step4 in the article's method
section,estimating the parameters of a model *when you already know
which variables to include*.It is the way this step is conducted that
I am sceptical about.

Regards,

Gustaf

-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Michal Figurski

Thank you all for your words of wisdom.

I start getting into what you mean by bootstrap. Not surprisingly, it 
seems to be something else than I do. The bootstrap is a tool, and I 
would rather compare it to a hammer than to a gun. People say that 
hammer is for driving nails. This situation is as if I planned to use it 
to break rocks.


The key point is that I don't really care about the bias or variance of 
the mean in the model. These things are useful for statisticians; 
regular people (like me, also a chemist) do not understand them and have 
no use for them (well, now I somewhat understand). My goal is very 
practical: I need an equation that can predict patient's outcome, based 
on some data, with maximum reliability and accuracy.


I have found from the mentioned paper (and from my own experience) that 
re-sampling and running the regression on re-sampled dataset multiple 
times does improve predictions. You have a proof of that in that paper, 
page 1502, and to me it is rather a stunning proof: compare 56% to 82% 
of correctly predicted values (correct means within 15% of original value).


I can understand that it's somewhat new for many of you, and some tried 
to discourage me from this approach (shooting my foot). This concept was 
devised by, I believe, Mr Michael Hale, a respectable biostatistician. 
It utilises bootstrap concept of resampling, though, after recent 
discussion, I think it should be called another name.


In addition to better predictive performance, using this concept I also 
get a second dataset with each iteration, that can be used for 
validation of the model. In this approach the validation data are 
accumulated throughout the bootstrap, and then used in the end to 
calculate log residuals using equation with median coefficients. I am 
sure you can question that in many ways, but to me this is as good as 
you can get.


To be more practical, I will ask the authors of this paper if I can post 
their original dataset in this forum (I have it somewhere) - if you guys 
think it's interesting enough. Then anyone of you could use it, follow 
the procedure, and criticize, if they wish.


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

S Ellison wrote:

jeez, but you've kicked up a storm!

penn'orth on the bootstrap; and since I'm a chemist, you can ignore at
will.

The bootstrap starts with your data and the model you developed with
it. Resampling gives a fair idea of what the variance _around your
current estimate_ is. But it cannot tell you how biased you are or
improve your estimate, because there is no more information in your
data. 


Toy example. Let's say I get some results from some measurement
procedure, like this.

set.seed(408) #so we get the same random sample (!)

y-rnorm(12,5) #OK, not a very convincing measurement, but

#Now let's add a bit of bias
y-y+3

mean(y) #... is my (biased) estimate of the mean value.

#Now let's pretend I don't know the true answer OR the bias, which is
what happens 
#in the real world, and try bootsrapping. Let's get a rather generous 
#1 resamples from my data;


m-matrix(sample(y, length(y)*1, replace=T), ncol=length(y))
#This gives me a matrix with 1 rows, each of which is a resample 
#of my 12 data. 


#And now we can calculate 1 bootstrapped means in one shot:
bs.mean-apply(m,1,mean) #which applies 'mean' to each row.

#We hope the variance of these things is about 1/12, 'cos we got y from
a normal distribution 
#with var 1 and we had 12 of them.  let's see...

var(bs.mean)

#which should resemble
1/12

#and does.. roughly. 
#And for interest, compare with what we go direct from the data;

var(y)/12
#which in this case was slightly further from the 'true' variance. It
won't always be, though; 
#that depends on the data.


#Anyway, the bootstrap variance looks about right. So ... on to bias

#Now, where would we expect the bootstrapped mean value to be? 
#At the true value, or where we started?

mean(bs.mean)

#Oh dear. It's still biased. And it looks very much like the mean of
y.
#It's clearly told us nothing about the true mean.

#Bottom line; All you have is your data. Bootstrapping uses your data.

#Therefore, bootstrapping can tell you no more than you can get from
your data.
#But it's still useful if you have some rather more complicated
statistic derived from 
#a non-linear fit, because it lets you get some idea of the variance.

#But not the bias.

This may be why some folk felt that your solution as worded (an
ever-present peril, wording) was not an answer to the right question.
The fitting procedure already gives you the 'best estimate' (where
'best' means max likelihood, this time), and bootstrapping really cannot
improve on that. It can only start at your current 'best' and move away
from it in a random direction.  That can't possibly improve the
estimated coefficients. 

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Greg Snow
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
 Sent: Wednesday, July 23, 2008 10:22 AM
 To: r-help@r-project.org
 Subject: Re: [R] Coefficients of Logistic Regression from
 bootstrap - how to get them?

 Thank you all for your words of wisdom.

 I start getting into what you mean by bootstrap. Not
 surprisingly, it seems to be something else than I do. The
 bootstrap is a tool, and I would rather compare it to a
 hammer than to a gun. People say that hammer is for driving
 nails. This situation is as if I planned to use it to break rocks.

The bootstrap is more like a whole toolbox than just a single tool.  I think 
part of the confusion in this discussion is because you kept asking for a 
hammer and Frank and others kept looking at their toolbox full of hammers and 
asking you which one you wanted.  Yes you can break a rock with a hammer 
designed to drive nails, but why not use the hammer designed to break rocks 
when it is easily available.


 The key point is that I don't really care about the bias or
 variance of the mean in the model. These things are useful
 for statisticians; regular people (like me, also a chemist)
 do not understand them and have no use for them (well, now I
 somewhat understand). My goal is very
 practical: I need an equation that can predict patient's
 outcome, based on some data, with maximum reliability and accuracy.

But to get the model with maximum reliability and accuracy you need to account 
for bias and minimize variability.  You may not care what those numbers are 
directly, but you do care indirectly about their influence on your final model. 
 Another instance where both sides were talking past each other.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Frank E Harrell Jr

Gustaf Rydevik wrote:

On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
[EMAIL PROTECTED] wrote:

Gustaf,

I am sorry, but I don't get the point. Let's just focus on predictive
performance from the cited passage, that is the number of values predicted
within 15% of the original value.
So, the predictive performance from the model fit on entire dataset was 56%
of profiles, while from bootstrapped model it was 82% of profiles. Well - I
see a stunning purpose in the bootstrap step here: it turns an useless
equation into a clinically applicable model!

Honestly, I also can't see how this can be better than fitting on entire
dataset, but here you have a proof that it is.

I think that another argument supporting this approach is model validation.
If you fit model on entire data, you have no data left to validate its
predictions.

On the other hand, I agree with you that the passage in methods section
looks awkward.

In my work on a similar problem, that is going to appear in August in Ther
Drug Monit, I used medians since beginning and all the comparisons were done
based on models with median coefficients. I think this is what the authors
of that paper did, though they might just have had a problem with describing
it correctly, and unfortunately it passed through review process unchanged.





Hi,

I believe that you misunderstand the passage. Do you know what
multiple stepwise regression is?

Since they used SPSS, I copied from
http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm

Stepwise selection is a combination of forward and backward procedures.
Step 1

The first predictor variable is selected in the same way as in forward
selection. If the probability associated with the test of significance
is less than or equal to the default .05, the predictor variable with
the largest correlation with the criterion variable enters the
equation first.


Step 2

The second variable is selected based on the highest partial
correlation. If it can pass the entry requirement (PIN=.05), it also
enters the equation.

Step 3


From this point, stepwise selection differs from forward selection:

the variables already in the equation are examined for removal
according to the removal criterion (POUT=.10) as in backward
elimination.

Step 4

Variables not in the equation are examined for entry. Variable
selection ends when no more variables meet entry and removal criteria.
---


It is the outcome of this *entire process*,step1-4, that they compare
with the outcome of their *entire bootstrap/crossvalidation/selection
process*, Step1-4 in the methods section, and find that their approach
gives better result
What you are doing is only step4 in the article's method
section,estimating the parameters of a model *when you already know
which variables to include*.It is the way this step is conducted that
I am sceptical about.

Regards,

Gustaf



Perfectly stated Gustaf.  This is a great example of needing to truly 
understand a method to be able to use it in the right context.


After having read most of the paper by Pawinski et al now, there are 
other problems.


1. The paper nowhere uses bootstrapping.  It uses repeated 2-fold 
cross-validation, a procedure not usually recommended.


2. The resampling procedure used in the paper treated the 50 
pharmacokinetic profiles on 21 renal transplant patients as if these 
were from 50 patients.  The cluster bootstrap should have been used instead.


3. Figure 2 showed the fitted regression line to the predicted vs. 
observed AUCs.  It should have shown the line of identify instead.  In 
other words, the authors allowed a subtle recalibration to creep into 
the analysis (and inverted the x- and y-variables in the plots).  The 
fitted lines are far enough away from the line of identity as to show 
that the predicted values are not well calibrated.  The r^2 values 
claimed by the authors used the wrong formulas which allowed an 
automatic after-the-fact recalibration (new overall slope and intercept 
are estimated in the test dataset).  Hence the achieved r^2 are misleading.



--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-23 Thread Gad Abraham

Michal Figurski wrote:

Thank you all for your words of wisdom.

I start getting into what you mean by bootstrap. Not surprisingly, it 
seems to be something else than I do. The bootstrap is a tool, and I 
would rather compare it to a hammer than to a gun. People say that 
hammer is for driving nails. This situation is as if I planned to use it 
to break rocks.


The key point is that I don't really care about the bias or variance of 
the mean in the model. These things are useful for statisticians; 
regular people (like me, also a chemist) do not understand them and have 
no use for them (well, now I somewhat understand). My goal is very 
practical: I need an equation that can predict patient's outcome, based 
on some data, with maximum reliability and accuracy.


My two cents:

Bootstrapping (especially the optimism bootstrap, see Harrell 2001 
``Regression Modeling Strategies'') can be used to estimate how well a 
given model generalises. In other words, to estimate how much your model 
is overfitted to your data (more overfitting = less generalisable model).


This in itself is not useful for getting the coefficients of a good 
model (which is always done through MLE), but it can be used to compare 
different models. As Frank Harrell mentioned, you can do penalised 
regression, and find the best penalty through bootstrapping. This will 
possibly yield a model that is less overfitted and hence more reliable 
in terms of being valid for an unseen sample (from the same population).

Again, see Frank's book for more information about penalisation.

--
Gad Abraham
Dept. CSSE and NICTA
The University of Melbourne
Parkville 3010, Victoria, Australia
email: [EMAIL PROTECTED]
web: http://www.csse.unimelb.edu.au/~gabraham

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Michal Figurski

Dear all,

I don't want to argue with anybody about words or about what bootstrap 
is suitable for - I know too little for that.


All I need is help to get the *equation coefficients* optimized by 
bootstrap - either by one of the functions or by simple median.


Please help,

--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

Frank E Harrell Jr wrote:

Michal Figurski wrote:

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set are just a 
small sample of a big population, then if I use my whole dataset to 
obtain max likelihood estimates, these estimates may be best for this 
dataset, but far from ideal for the whole population.


The bootstrap, being a resampling procedure from your sample, has the 
same issues about the population as MLEs.




I used bootstrap to virtually increase the size of my dataset, it 
should result in estimates more close to that from the population - 
isn't it the purpose of bootstrap?


No



When I use such median coefficients on another dataset (another sample 
from population), the predictions are better, than using max 
likelihood estimates. I have already tested that and it worked!


Then your testing procedure is probably not valid.



I am not a statistician and I don't feel what overfitting is, but it 
may be just another word for the same idea.


Nevertheless, I would still like to know how can I get the coeffcients 
for the model that gives the nearly unbiased estimates. I greatly 
appreciate your help.


More info in my book Regression Modeling Strategies.

Frank



--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

Frank E Harrell Jr wrote:

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using 
bootstrap. I was previously using SAS for this kind of tasks, but I 
am now switching to R.


My data frame consists of 5 columns and has 109 rows. Each row is a 
single record composed of the following values: Subject_name, 
numeric1, numeric2, numeric3 and outcome (yes or no). All three 
numerics are used to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, 
running LR on one half of data and making predictions on second 
half. Then it was collecting the equation coefficients from each 
iteration of bootstrap. Later I was just taking medians of these 
coefficients from all iterations, and used them as an optimal model 
- it really worked well!


Why not use maximum likelihood estimation, i.e., the coefficients 
from the original fit.  How does the bootstrap improve on that?




Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also experimented 
with function 'sm.binomial.bootstrap' from package sm. I tried 
also the function 'boot' from package boot, though without success 
- in my case it randomly selected _columns_ from my data frame, 
while I wanted it to select _rows_.


validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased estimate of the 
model performance, i.e., to correct for overfitting.


Frank Harrell



Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation coefficients 
from any of these bootstrap functions, in the same form as given by 
'glm' or 'lrm'.


Many thanks in advance!











__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Doran, Harold
I think the answer has been given to you. If you want to continue to
ignore that advice and use bootstrap for point estimates rather than the
properties of those estimates (which is what bootstrap is for) then you
are on your own. 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
 Sent: Tuesday, July 22, 2008 9:52 AM
 To: r-help@r-project.org
 Subject: Re: [R] Coefficients of Logistic Regression from 
 bootstrap - how to get them?
 
 Dear all,
 
 I don't want to argue with anybody about words or about what 
 bootstrap is suitable for - I know too little for that.
 
 All I need is help to get the *equation coefficients* 
 optimized by bootstrap - either by one of the functions or by 
 simple median.
 
 Please help,
 
 --
 Michal J. Figurski
 HUP, Pathology  Laboratory Medicine
 Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce 
 St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413
 
 Frank E Harrell Jr wrote:
  Michal Figurski wrote:
  Frank,
 
  How does bootstrap improve on that?
 
  I don't know, but I have an idea. Since the data in my set 
 are just a 
  small sample of a big population, then if I use my whole 
 dataset to 
  obtain max likelihood estimates, these estimates may be 
 best for this 
  dataset, but far from ideal for the whole population.
  
  The bootstrap, being a resampling procedure from your 
 sample, has the 
  same issues about the population as MLEs.
  
 
  I used bootstrap to virtually increase the size of my dataset, it 
  should result in estimates more close to that from the 
 population - 
  isn't it the purpose of bootstrap?
  
  No
  
 
  When I use such median coefficients on another dataset (another 
  sample from population), the predictions are better, than 
 using max 
  likelihood estimates. I have already tested that and it worked!
  
  Then your testing procedure is probably not valid.
  
 
  I am not a statistician and I don't feel what 
 overfitting is, but 
  it may be just another word for the same idea.
 
  Nevertheless, I would still like to know how can I get the 
  coeffcients for the model that gives the nearly unbiased 
 estimates. 
  I greatly appreciate your help.
  
  More info in my book Regression Modeling Strategies.
  
  Frank
  
 
  --
  Michal J. Figurski
  HUP, Pathology  Laboratory Medicine
  Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
  Maloney Philadelphia, PA 19104 tel. (215) 662-3413
 
  Frank E Harrell Jr wrote:
  Michal Figurski wrote:
  Hello all,
 
  I am trying to optimize my logistic regression model by using 
  bootstrap. I was previously using SAS for this kind of 
 tasks, but I 
  am now switching to R.
 
  My data frame consists of 5 columns and has 109 rows. 
 Each row is a 
  single record composed of the following values: Subject_name, 
  numeric1, numeric2, numeric3 and outcome (yes or no). All three 
  numerics are used to predict outcome using LR.
 
  In SAS I have written a macro, that was splitting the dataset, 
  running LR on one half of data and making predictions on second 
  half. Then it was collecting the equation coefficients from each 
  iteration of bootstrap. Later I was just taking medians of these 
  coefficients from all iterations, and used them as an 
 optimal model
  - it really worked well!
 
  Why not use maximum likelihood estimation, i.e., the coefficients 
  from the original fit.  How does the bootstrap improve on that?
 
 
  Now I want to do the same in R. I tried to use the 'validate' or 
  'calibrate' functions from package Design, and I also 
  experimented with function 'sm.binomial.bootstrap' from package 
  sm. I tried also the function 'boot' from package 
 boot, though 
  without success
  - in my case it randomly selected _columns_ from my data frame, 
  while I wanted it to select _rows_.
 
  validate and calibrate in Design do resampling on the rows
 
  Resampling is mainly used to get a nearly unbiased 
 estimate of the 
  model performance, i.e., to correct for overfitting.
 
  Frank Harrell
 
 
  Though the main point here is the optimized LR equation. I would 
  appreciate any help on how to extract the LR equation 
 coefficients 
  from any of these bootstrap functions, in the same form 
 as given by 
  'glm' or 'lrm'.
 
  Many thanks in advance!
 
 
 
 
  
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Michal Figurski

Hmm...

It sounds like ideology to me. I was asking for technical help. I know 
what I want to do, just don't know how to do it in R. I'll go back to 
SAS then. Thank you.


--
Michal J. Figurski

Doran, Harold wrote:

I think the answer has been given to you. If you want to continue to
ignore that advice and use bootstrap for point estimates rather than the
properties of those estimates (which is what bootstrap is for) then you
are on your own. 


-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski

Sent: Tuesday, July 22, 2008 9:52 AM
To: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from 
bootstrap - how to get them?


Dear all,

I don't want to argue with anybody about words or about what 
bootstrap is suitable for - I know too little for that.


All I need is help to get the *equation coefficients* 
optimized by bootstrap - either by one of the functions or by 
simple median.


Please help,

--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce 
St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set 
are just a 
small sample of a big population, then if I use my whole 
dataset to 
obtain max likelihood estimates, these estimates may be 
best for this 

dataset, but far from ideal for the whole population.
The bootstrap, being a resampling procedure from your 
sample, has the 

same issues about the population as MLEs.

I used bootstrap to virtually increase the size of my dataset, it 
should result in estimates more close to that from the 
population - 

isn't it the purpose of bootstrap?

No

When I use such median coefficients on another dataset (another 
sample from population), the predictions are better, than 
using max 

likelihood estimates. I have already tested that and it worked!

Then your testing procedure is probably not valid.

I am not a statistician and I don't feel what 
overfitting is, but 

it may be just another word for the same idea.

Nevertheless, I would still like to know how can I get the 
coeffcients for the model that gives the nearly unbiased 
estimates. 

I greatly appreciate your help.

More info in my book Regression Modeling Strategies.

Frank


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using 
bootstrap. I was previously using SAS for this kind of 
tasks, but I 

am now switching to R.

My data frame consists of 5 columns and has 109 rows. 
Each row is a 
single record composed of the following values: Subject_name, 
numeric1, numeric2, numeric3 and outcome (yes or no). All three 
numerics are used to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, 
running LR on one half of data and making predictions on second 
half. Then it was collecting the equation coefficients from each 
iteration of bootstrap. Later I was just taking medians of these 
coefficients from all iterations, and used them as an 

optimal model

- it really worked well!
Why not use maximum likelihood estimation, i.e., the coefficients 
from the original fit.  How does the bootstrap improve on that?


Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also 
experimented with function 'sm.binomial.bootstrap' from package 
sm. I tried also the function 'boot' from package 
boot, though 

without success
- in my case it randomly selected _columns_ from my data frame, 
while I wanted it to select _rows_.

validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased 
estimate of the 

model performance, i.e., to correct for overfitting.

Frank Harrell

Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation 
coefficients 
from any of these bootstrap functions, in the same form 
as given by 

'glm' or 'lrm'.

Many thanks in advance!






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Doran, Harold
Probably a good idea for you. The R help list is useful for both
programming AND statistical advice for those who want it.

 

 -Original Message-
 From: Michal Figurski [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, July 22, 2008 10:44 AM
 To: Doran, Harold; r-help@r-project.org
 Subject: Re: [R] Coefficients of Logistic Regression from 
 bootstrap - how to get them?
 
 Hmm...
 
 It sounds like ideology to me. I was asking for technical 
 help. I know what I want to do, just don't know how to do it 
 in R. I'll go back to SAS then. Thank you.
 
 --
 Michal J. Figurski
 
 Doran, Harold wrote:
  I think the answer has been given to you. If you want to 
 continue to 
  ignore that advice and use bootstrap for point estimates 
 rather than 
  the properties of those estimates (which is what bootstrap is for) 
  then you are on your own.
  
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
  Sent: Tuesday, July 22, 2008 9:52 AM
  To: r-help@r-project.org
  Subject: Re: [R] Coefficients of Logistic Regression from 
 bootstrap - 
  how to get them?
 
  Dear all,
 
  I don't want to argue with anybody about words or about what 
  bootstrap is suitable for - I know too little for that.
 
  All I need is help to get the *equation coefficients* optimized by 
  bootstrap - either by one of the functions or by simple median.
 
  Please help,
 
  --
  Michal J. Figurski
  HUP, Pathology  Laboratory Medicine
  Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
  Maloney Philadelphia, PA 19104 tel. (215) 662-3413
 
  Frank E Harrell Jr wrote:
  Michal Figurski wrote:
  Frank,
 
  How does bootstrap improve on that?
 
  I don't know, but I have an idea. Since the data in my set
  are just a
  small sample of a big population, then if I use my whole
  dataset to
  obtain max likelihood estimates, these estimates may be
  best for this
  dataset, but far from ideal for the whole population.
  The bootstrap, being a resampling procedure from your
  sample, has the
  same issues about the population as MLEs.
 
  I used bootstrap to virtually increase the size of my 
 dataset, it 
  should result in estimates more close to that from the
  population -
  isn't it the purpose of bootstrap?
  No
 
  When I use such median coefficients on another dataset (another 
  sample from population), the predictions are better, than
  using max
  likelihood estimates. I have already tested that and it worked!
  Then your testing procedure is probably not valid.
 
  I am not a statistician and I don't feel what
  overfitting is, but
  it may be just another word for the same idea.
 
  Nevertheless, I would still like to know how can I get the 
  coeffcients for the model that gives the nearly unbiased
  estimates. 
  I greatly appreciate your help.
  More info in my book Regression Modeling Strategies.
 
  Frank
 
  --
  Michal J. Figurski
  HUP, Pathology  Laboratory Medicine Xenobiotics Toxicokinetics 
  Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 
  19104 tel. (215) 662-3413
 
  Frank E Harrell Jr wrote:
  Michal Figurski wrote:
  Hello all,
 
  I am trying to optimize my logistic regression model by using 
  bootstrap. I was previously using SAS for this kind of
  tasks, but I
  am now switching to R.
 
  My data frame consists of 5 columns and has 109 rows. 
  Each row is a
  single record composed of the following values: Subject_name, 
  numeric1, numeric2, numeric3 and outcome (yes or no). 
 All three 
  numerics are used to predict outcome using LR.
 
  In SAS I have written a macro, that was splitting the dataset, 
  running LR on one half of data and making predictions 
 on second 
  half. Then it was collecting the equation coefficients 
 from each 
  iteration of bootstrap. Later I was just taking 
 medians of these 
  coefficients from all iterations, and used them as an
  optimal model
  - it really worked well!
  Why not use maximum likelihood estimation, i.e., the 
 coefficients 
  from the original fit.  How does the bootstrap improve on that?
 
  Now I want to do the same in R. I tried to use the 
 'validate' or 
  'calibrate' functions from package Design, and I also 
  experimented with function 'sm.binomial.bootstrap' 
 from package 
  sm. I tried also the function 'boot' from package
  boot, though
  without success
  - in my case it randomly selected _columns_ from my 
 data frame, 
  while I wanted it to select _rows_.
  validate and calibrate in Design do resampling on the rows
 
  Resampling is mainly used to get a nearly unbiased
  estimate of the
  model performance, i.e., to correct for overfitting.
 
  Frank Harrell
 
  Though the main point here is the optimized LR 
 equation. I would 
  appreciate any help on how to extract the LR equation
  coefficients
  from any of these bootstrap functions, in the same form
  as given by
  'glm' or 'lrm'.
 
  Many thanks in advance

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread N. Lapidus
Hi Michal,
This paper by John Fox may help you to precise what you are looking for and
to perform your analyses
http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf
Nael

On Tue, Jul 22, 2008 at 3:51 PM, Michal Figurski 
[EMAIL PROTECTED] wrote:

 Dear all,

 I don't want to argue with anybody about words or about what bootstrap is
 suitable for - I know too little for that.

 All I need is help to get the *equation coefficients* optimized by
 bootstrap - either by one of the functions or by simple median.

 Please help,

 --
 Michal J. Figurski
 HUP, Pathology  Laboratory Medicine
 Xenobiotics Toxicokinetics Research Laboratory
 3400 Spruce St. 7 Maloney
 Philadelphia, PA 19104
 tel. (215) 662-3413

 Frank E Harrell Jr wrote:

 Michal Figurski wrote:

 Frank,

 How does bootstrap improve on that?

 I don't know, but I have an idea. Since the data in my set are just a
 small sample of a big population, then if I use my whole dataset to obtain
 max likelihood estimates, these estimates may be best for this dataset, but
 far from ideal for the whole population.


 The bootstrap, being a resampling procedure from your sample, has the same
 issues about the population as MLEs.


 I used bootstrap to virtually increase the size of my dataset, it should
 result in estimates more close to that from the population - isn't it the
 purpose of bootstrap?


 No


 When I use such median coefficients on another dataset (another sample
 from population), the predictions are better, than using max likelihood
 estimates. I have already tested that and it worked!


 Then your testing procedure is probably not valid.


 I am not a statistician and I don't feel what overfitting is, but it
 may be just another word for the same idea.

 Nevertheless, I would still like to know how can I get the coeffcients
 for the model that gives the nearly unbiased estimates. I greatly
 appreciate your help.


 More info in my book Regression Modeling Strategies.

 Frank


 --
 Michal J. Figurski
 HUP, Pathology  Laboratory Medicine
 Xenobiotics Toxicokinetics Research Laboratory
 3400 Spruce St. 7 Maloney
 Philadelphia, PA 19104
 tel. (215) 662-3413

 Frank E Harrell Jr wrote:

 Michal Figurski wrote:

 Hello all,

 I am trying to optimize my logistic regression model by using
 bootstrap. I was previously using SAS for this kind of tasks, but I am now
 switching to R.

 My data frame consists of 5 columns and has 109 rows. Each row is a
 single record composed of the following values: Subject_name, numeric1,
 numeric2, numeric3 and outcome (yes or no). All three numerics are used to
 predict outcome using LR.

 In SAS I have written a macro, that was splitting the dataset, running
 LR on one half of data and making predictions on second half. Then it was
 collecting the equation coefficients from each iteration of bootstrap. 
 Later
 I was just taking medians of these coefficients from all iterations, and
 used them as an optimal model - it really worked well!


 Why not use maximum likelihood estimation, i.e., the coefficients from
 the original fit.  How does the bootstrap improve on that?


 Now I want to do the same in R. I tried to use the 'validate' or
 'calibrate' functions from package Design, and I also experimented with
 function 'sm.binomial.bootstrap' from package sm. I tried also the
 function 'boot' from package boot, though without success - in my case 
 it
 randomly selected _columns_ from my data frame, while I wanted it to 
 select
 _rows_.


 validate and calibrate in Design do resampling on the rows

 Resampling is mainly used to get a nearly unbiased estimate of the model
 performance, i.e., to correct for overfitting.

 Frank Harrell


 Though the main point here is the optimized LR equation. I would
 appreciate any help on how to extract the LR equation coefficients from 
 any
 of these bootstrap functions, in the same form as given by 'glm' or 'lrm'.

 Many thanks in advance!







 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Bert Gunter
The bootstrap **can** be used for bias correction. However, it may not be
such a good thing to do. I quote from Efron and Tibshirani's AN INTRODUCTION
TO THE BOOTSTRAP (p.138):

... bias estimation is usually interesting and worthwhile, but the exact
use of a bias estimate is often problematic. Biases are harder to estimate
than than standard errors... The straightforward bias correxction can be
dangerous to use in practice, due to high variability in bias.  Correcting
the bias may cause a large increase in the standard error, which in turn
results in a larger rms... 

Proceed at your own risk...

Cheers,
Bert Gunter
Genentech

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Michal Figurski
Sent: Tuesday, July 22, 2008 7:44 AM
To: Doran, Harold; r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - how to
get them?

Hmm...

It sounds like ideology to me. I was asking for technical help. I know 
what I want to do, just don't know how to do it in R. I'll go back to 
SAS then. Thank you.

--
Michal J. Figurski

Doran, Harold wrote:
 I think the answer has been given to you. If you want to continue to
 ignore that advice and use bootstrap for point estimates rather than the
 properties of those estimates (which is what bootstrap is for) then you
 are on your own. 
 
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
 Sent: Tuesday, July 22, 2008 9:52 AM
 To: r-help@r-project.org
 Subject: Re: [R] Coefficients of Logistic Regression from 
 bootstrap - how to get them?

 Dear all,

 I don't want to argue with anybody about words or about what 
 bootstrap is suitable for - I know too little for that.

 All I need is help to get the *equation coefficients* 
 optimized by bootstrap - either by one of the functions or by 
 simple median.

 Please help,

 --
 Michal J. Figurski
 HUP, Pathology  Laboratory Medicine
 Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce 
 St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413

 Frank E Harrell Jr wrote:
 Michal Figurski wrote:
 Frank,

 How does bootstrap improve on that?

 I don't know, but I have an idea. Since the data in my set 
 are just a 
 small sample of a big population, then if I use my whole 
 dataset to 
 obtain max likelihood estimates, these estimates may be 
 best for this 
 dataset, but far from ideal for the whole population.
 The bootstrap, being a resampling procedure from your 
 sample, has the 
 same issues about the population as MLEs.

 I used bootstrap to virtually increase the size of my dataset, it 
 should result in estimates more close to that from the 
 population - 
 isn't it the purpose of bootstrap?
 No

 When I use such median coefficients on another dataset (another 
 sample from population), the predictions are better, than 
 using max 
 likelihood estimates. I have already tested that and it worked!
 Then your testing procedure is probably not valid.

 I am not a statistician and I don't feel what 
 overfitting is, but 
 it may be just another word for the same idea.

 Nevertheless, I would still like to know how can I get the 
 coeffcients for the model that gives the nearly unbiased 
 estimates. 
 I greatly appreciate your help.
 More info in my book Regression Modeling Strategies.

 Frank

 --
 Michal J. Figurski
 HUP, Pathology  Laboratory Medicine
 Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
 Maloney Philadelphia, PA 19104 tel. (215) 662-3413

 Frank E Harrell Jr wrote:
 Michal Figurski wrote:
 Hello all,

 I am trying to optimize my logistic regression model by using 
 bootstrap. I was previously using SAS for this kind of 
 tasks, but I 
 am now switching to R.

 My data frame consists of 5 columns and has 109 rows. 
 Each row is a 
 single record composed of the following values: Subject_name, 
 numeric1, numeric2, numeric3 and outcome (yes or no). All three 
 numerics are used to predict outcome using LR.

 In SAS I have written a macro, that was splitting the dataset, 
 running LR on one half of data and making predictions on second 
 half. Then it was collecting the equation coefficients from each 
 iteration of bootstrap. Later I was just taking medians of these 
 coefficients from all iterations, and used them as an 
 optimal model
 - it really worked well!
 Why not use maximum likelihood estimation, i.e., the coefficients 
 from the original fit.  How does the bootstrap improve on that?

 Now I want to do the same in R. I tried to use the 'validate' or 
 'calibrate' functions from package Design, and I also 
 experimented with function 'sm.binomial.bootstrap' from package 
 sm. I tried also the function 'boot' from package 
 boot, though 
 without success
 - in my case it randomly selected _columns_ from my data frame, 
 while I wanted it to select _rows_.
 validate and calibrate in Design do resampling on the rows

 Resampling is mainly used to get

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Marc Schwartz

Michal,

With all due respect, you have openly acknowledged that you don't know 
enough about the subject at hand.


If that is the case, on what basis are you in a position to challenge 
the collective wisdom of those professionals who have voluntarily 
offered *expert* level statistical advice to you?


You have erected a wall around your thinking.

You may choose to use R or any other software application to 
Git-R-Done. But that does not make it correct.


There are other methods to consider that could be used during the model 
building process itself, rather than on a post-hoc basis and I would 
specifically refer you to Frank's book, Regression Modeling Strategies:


  http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS

Marc Schwartz

on 07/22/2008 09:43 AM Michal Figurski wrote:

Hmm...

It sounds like ideology to me. I was asking for technical help. I know 
what I want to do, just don't know how to do it in R. I'll go back to 
SAS then. Thank you.


--
Michal J. Figurski

Doran, Harold wrote:

I think the answer has been given to you. If you want to continue to
ignore that advice and use bootstrap for point estimates rather than the
properties of those estimates (which is what bootstrap is for) then you
are on your own.

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski

Sent: Tuesday, July 22, 2008 9:52 AM
To: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - 
how to get them?


Dear all,

I don't want to argue with anybody about words or about what 
bootstrap is suitable for - I know too little for that.


All I need is help to get the *equation coefficients* optimized by 
bootstrap - either by one of the functions or by simple median.


Please help,

--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set 

are just a
small sample of a big population, then if I use my whole 

dataset to
obtain max likelihood estimates, these estimates may be 

best for this

dataset, but far from ideal for the whole population.
The bootstrap, being a resampling procedure from your 

sample, has the

same issues about the population as MLEs.

I used bootstrap to virtually increase the size of my dataset, it 
should result in estimates more close to that from the 

population -

isn't it the purpose of bootstrap?

No

When I use such median coefficients on another dataset (another 
sample from population), the predictions are better, than 

using max

likelihood estimates. I have already tested that and it worked!

Then your testing procedure is probably not valid.

I am not a statistician and I don't feel what 

overfitting is, but

it may be just another word for the same idea.

Nevertheless, I would still like to know how can I get the 
coeffcients for the model that gives the nearly unbiased 

estimates.

I greatly appreciate your help.

More info in my book Regression Modeling Strategies.

Frank


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using 
bootstrap. I was previously using SAS for this kind of 

tasks, but I

am now switching to R.

My data frame consists of 5 columns and has 109 rows. 

Each row is a
single record composed of the following values: Subject_name, 
numeric1, numeric2, numeric3 and outcome (yes or no). All three 
numerics are used to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, 
running LR on one half of data and making predictions on second 
half. Then it was collecting the equation coefficients from each 
iteration of bootstrap. Later I was just taking medians of these 
coefficients from all iterations, and used them as an 

optimal model

- it really worked well!
Why not use maximum likelihood estimation, i.e., the coefficients 
from the original fit.  How does the bootstrap improve on that?


Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also 
experimented with function 'sm.binomial.bootstrap' from package 
sm. I tried also the function 'boot' from package 

boot, though

without success
- in my case it randomly selected _columns_ from my data frame, 
while I wanted it to select _rows_.

validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased 

estimate of the

model performance, i.e., to correct for overfitting.

Frank Harrell

Though the main point here is the optimized LR equation. I would

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Doran, Harold
 install.packages('fortunes')
 library(fortunes)
 fortune(28) 


 -Original Message-
 From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, July 22, 2008 1:29 PM
 To: Michal Figurski
 Cc: Doran, Harold; r-help@r-project.org; Frank E Harrell Jr; 
 Bert Gunter
 Subject: Re: [R] Coefficients of Logistic Regression from 
 bootstrap - how to get them?
 
 Michal,
 
 With all due respect, you have openly acknowledged that you 
 don't know enough about the subject at hand.
 
 If that is the case, on what basis are you in a position to 
 challenge the collective wisdom of those professionals who 
 have voluntarily offered *expert* level statistical advice to you?
 
 You have erected a wall around your thinking.
 
 You may choose to use R or any other software application to 
 Git-R-Done. But that does not make it correct.
 
 There are other methods to consider that could be used during 
 the model building process itself, rather than on a post-hoc 
 basis and I would specifically refer you to Frank's book, 
 Regression Modeling Strategies:
 
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS
 
 Marc Schwartz
 
 on 07/22/2008 09:43 AM Michal Figurski wrote:
  Hmm...
  
  It sounds like ideology to me. I was asking for technical 
 help. I know 
  what I want to do, just don't know how to do it in R. I'll 
 go back to 
  SAS then. Thank you.
  
  --
  Michal J. Figurski
  
  Doran, Harold wrote:
  I think the answer has been given to you. If you want to 
 continue to 
  ignore that advice and use bootstrap for point estimates 
 rather than 
  the properties of those estimates (which is what bootstrap is for) 
  then you are on your own.
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski
  Sent: Tuesday, July 22, 2008 9:52 AM
  To: r-help@r-project.org
  Subject: Re: [R] Coefficients of Logistic Regression from 
 bootstrap 
  - how to get them?
 
  Dear all,
 
  I don't want to argue with anybody about words or about what 
  bootstrap is suitable for - I know too little for that.
 
  All I need is help to get the *equation coefficients* 
 optimized by 
  bootstrap - either by one of the functions or by simple median.
 
  Please help,
 
  --
  Michal J. Figurski
  HUP, Pathology  Laboratory Medicine Xenobiotics Toxicokinetics 
  Research Laboratory 3400 Spruce St. 7 Maloney 
 Philadelphia, PA 19104 
  tel. (215) 662-3413
 
  Frank E Harrell Jr wrote:
  Michal Figurski wrote:
  Frank,
 
  How does bootstrap improve on that?
 
  I don't know, but I have an idea. Since the data in my set
  are just a
  small sample of a big population, then if I use my whole
  dataset to
  obtain max likelihood estimates, these estimates may be
  best for this
  dataset, but far from ideal for the whole population.
  The bootstrap, being a resampling procedure from your
  sample, has the
  same issues about the population as MLEs.
 
  I used bootstrap to virtually increase the size of my 
 dataset, it 
  should result in estimates more close to that from the
  population -
  isn't it the purpose of bootstrap?
  No
 
  When I use such median coefficients on another dataset (another 
  sample from population), the predictions are better, than
  using max
  likelihood estimates. I have already tested that and it worked!
  Then your testing procedure is probably not valid.
 
  I am not a statistician and I don't feel what
  overfitting is, but
  it may be just another word for the same idea.
 
  Nevertheless, I would still like to know how can I get the 
  coeffcients for the model that gives the nearly unbiased
  estimates.
  I greatly appreciate your help.
  More info in my book Regression Modeling Strategies.
 
  Frank
 
  --
  Michal J. Figurski
  HUP, Pathology  Laboratory Medicine Xenobiotics Toxicokinetics 
  Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 
  19104 tel. (215) 662-3413
 
  Frank E Harrell Jr wrote:
  Michal Figurski wrote:
  Hello all,
 
  I am trying to optimize my logistic regression model by using 
  bootstrap. I was previously using SAS for this kind of
  tasks, but I
  am now switching to R.
 
  My data frame consists of 5 columns and has 109 rows. 
  Each row is a
  single record composed of the following values: Subject_name, 
  numeric1, numeric2, numeric3 and outcome (yes or no). 
 All three 
  numerics are used to predict outcome using LR.
 
  In SAS I have written a macro, that was splitting the 
 dataset, 
  running LR on one half of data and making predictions 
 on second 
  half. Then it was collecting the equation 
 coefficients from each 
  iteration of bootstrap. Later I was just taking 
 medians of these 
  coefficients from all iterations, and used them as an
  optimal model
  - it really worked well!
  Why not use maximum likelihood estimation, i.e., the 
 coefficients 
  from the original fit.  How does the bootstrap improve on that?
 
  Now I want to do the same in R. I tried to use

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Michal Figurski

Dear Marc and all,

Thank you for all the due respect.

I tried to explain as much explicitly as I could what I am trying to do 
in my first email. I did not invent this procedure, it was already 
published in the paper:


T. Pawinski, M. Hale, M. Korecka, W.E. Fitzsimmons, L.M. Shaw. Limited 
Sampling Strategy for the Estimation of Mycophenolic Acid Area under the 
Curve in Adult Renal Transplant Patients Treated with Concomitant 
Tacrolimus. Clinical Chemistry 2002(48:9), 1497-1504


I only adopted this methodology to work under SAS and now I try to do it 
under R, because I like R. I need a practical advice because I have a 
practical problem, and I do not understand much of the theoretical 
discussion on what bootstrap is suitable for or not. Apparently I am 
trying to use it for something else than the experts are used to...


Honestly, I did not learn anything from this discussion so far, I am 
just disappointed.


Though, since the discussion has already started, I'd welcome your 
criticism on this procedure - I just ask that you express it in human 
language.


--
Michal J. Figurski

Marc Schwartz wrote:

Michal,

With all due respect, you have openly acknowledged that you don't know 
enough about the subject at hand.


If that is the case, on what basis are you in a position to challenge 
the collective wisdom of those professionals who have voluntarily 
offered *expert* level statistical advice to you?


You have erected a wall around your thinking.

You may choose to use R or any other software application to 
Git-R-Done. But that does not make it correct.


There are other methods to consider that could be used during the model 
building process itself, rather than on a post-hoc basis and I would 
specifically refer you to Frank's book, Regression Modeling Strategies:


  http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS

Marc Schwartz

on 07/22/2008 09:43 AM Michal Figurski wrote:

Hmm...

It sounds like ideology to me. I was asking for technical help. I know 
what I want to do, just don't know how to do it in R. I'll go back to 
SAS then. Thank you.


--
Michal J. Figurski

Doran, Harold wrote:

I think the answer has been given to you. If you want to continue to
ignore that advice and use bootstrap for point estimates rather than the
properties of those estimates (which is what bootstrap is for) then you
are on your own.

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski

Sent: Tuesday, July 22, 2008 9:52 AM
To: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap 
- how to get them?


Dear all,

I don't want to argue with anybody about words or about what 
bootstrap is suitable for - I know too little for that.


All I need is help to get the *equation coefficients* optimized by 
bootstrap - either by one of the functions or by simple median.


Please help,

--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set 

are just a
small sample of a big population, then if I use my whole 

dataset to
obtain max likelihood estimates, these estimates may be 

best for this

dataset, but far from ideal for the whole population.
The bootstrap, being a resampling procedure from your 

sample, has the

same issues about the population as MLEs.

I used bootstrap to virtually increase the size of my dataset, it 
should result in estimates more close to that from the 

population -

isn't it the purpose of bootstrap?

No

When I use such median coefficients on another dataset (another 
sample from population), the predictions are better, than 

using max

likelihood estimates. I have already tested that and it worked!

Then your testing procedure is probably not valid.

I am not a statistician and I don't feel what 

overfitting is, but

it may be just another word for the same idea.

Nevertheless, I would still like to know how can I get the 
coeffcients for the model that gives the nearly unbiased 

estimates.

I greatly appreciate your help.

More info in my book Regression Modeling Strategies.

Frank


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using 
bootstrap. I was previously using SAS for this kind of 

tasks, but I

am now switching to R.

My data frame consists of 5 columns and has 109 rows. 

Each row is a
single record composed of the following values: Subject_name, 
numeric1, numeric2, numeric3 and outcome (yes or no). All three 
numerics are used to predict outcome using

Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-22 Thread Frank E Harrell Jr

Michal Figurski wrote:

Dear Marc and all,

Thank you for all the due respect.

I tried to explain as much explicitly as I could what I am trying to do 
in my first email. I did not invent this procedure, it was already 
published in the paper:


T. Pawinski, M. Hale, M. Korecka, W.E. Fitzsimmons, L.M. Shaw. Limited 
Sampling Strategy for the Estimation of Mycophenolic Acid Area under the 
Curve in Adult Renal Transplant Patients Treated with Concomitant 
Tacrolimus. Clinical Chemistry 2002(48:9), 1497-1504


If you send me a pdf of this paper I will be glad to take a look.

Rather than an ad hoc bootstrap procedure you might look at the 
resistent/robust fit literature and use an objective function that 
spells out what is being optimized.


There probably are cases where taking the median of a set of bootstrap 
regression coefficient estimates works well in a certain sense, but I 
would put my money on penalized maximum likelihood estimation.


As Marc said, your attitude towards free advice is puzzling.

Frank



I only adopted this methodology to work under SAS and now I try to do it 
under R, because I like R. I need a practical advice because I have a 
practical problem, and I do not understand much of the theoretical 
discussion on what bootstrap is suitable for or not. Apparently I am 
trying to use it for something else than the experts are used to...


Honestly, I did not learn anything from this discussion so far, I am 
just disappointed.


Though, since the discussion has already started, I'd welcome your 
criticism on this procedure - I just ask that you express it in human 
language.


--
Michal J. Figurski

Marc Schwartz wrote:

Michal,

With all due respect, you have openly acknowledged that you don't know 
enough about the subject at hand.


If that is the case, on what basis are you in a position to challenge 
the collective wisdom of those professionals who have voluntarily 
offered *expert* level statistical advice to you?


You have erected a wall around your thinking.

You may choose to use R or any other software application to 
Git-R-Done. But that does not make it correct.


There are other methods to consider that could be used during the 
model building process itself, rather than on a post-hoc basis and I 
would specifically refer you to Frank's book, Regression Modeling 
Strategies:


  http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS

Marc Schwartz

on 07/22/2008 09:43 AM Michal Figurski wrote:

Hmm...

It sounds like ideology to me. I was asking for technical help. I 
know what I want to do, just don't know how to do it in R. I'll go 
back to SAS then. Thank you.


--
Michal J. Figurski

Doran, Harold wrote:

I think the answer has been given to you. If you want to continue to
ignore that advice and use bootstrap for point estimates rather than 
the

properties of those estimates (which is what bootstrap is for) then you
are on your own.

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Michal Figurski

Sent: Tuesday, July 22, 2008 9:52 AM
To: r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap 
- how to get them?


Dear all,

I don't want to argue with anybody about words or about what 
bootstrap is suitable for - I know too little for that.


All I need is help to get the *equation coefficients* optimized by 
bootstrap - either by one of the functions or by simple median.


Please help,

--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 
Maloney Philadelphia, PA 19104 tel. (215) 662-3413


Frank E Harrell Jr wrote:

Michal Figurski wrote:

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set 

are just a
small sample of a big population, then if I use my whole 

dataset to
obtain max likelihood estimates, these estimates may be 

best for this

dataset, but far from ideal for the whole population.
The bootstrap, being a resampling procedure from your 

sample, has the

same issues about the population as MLEs.

I used bootstrap to virtually increase the size of my dataset, it 
should result in estimates more close to that from the 

population -

isn't it the purpose of bootstrap?

No

When I use such median coefficients on another dataset (another 
sample from population), the predictions are better, than 

using max

likelihood estimates. I have already tested that and it worked!

Then your testing procedure is probably not valid.

I am not a statistician and I don't feel what 

overfitting is, but

it may be just another word for the same idea.

Nevertheless, I would still like to know how can I get the 
coeffcients for the model that gives the nearly unbiased 

estimates.

I greatly appreciate your help.

More info in my book Regression Modeling Strategies.

Frank


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread Michal Figurski

Hello all,

I am trying to optimize my logistic regression model by using bootstrap. 
I was previously using SAS for this kind of tasks, but I am now 
switching to R.


My data frame consists of 5 columns and has 109 rows. Each row is a 
single record composed of the following values: Subject_name, numeric1, 
numeric2, numeric3 and outcome (yes or no). All three numerics are used 
to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, running 
LR on one half of data and making predictions on second half. Then it 
was collecting the equation coefficients from each iteration of 
bootstrap. Later I was just taking medians of these coefficients from 
all iterations, and used them as an optimal model - it really worked well!


Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also experimented 
with function 'sm.binomial.bootstrap' from package sm. I tried also 
the function 'boot' from package boot, though without success - in my 
case it randomly selected _columns_ from my data frame, while I wanted 
it to select _rows_.


Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation coefficients from 
any of these bootstrap functions, in the same form as given by 'glm' or 
'lrm'.


Many thanks in advance!

--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread Frank E Harrell Jr

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using bootstrap. 
I was previously using SAS for this kind of tasks, but I am now 
switching to R.


My data frame consists of 5 columns and has 109 rows. Each row is a 
single record composed of the following values: Subject_name, numeric1, 
numeric2, numeric3 and outcome (yes or no). All three numerics are used 
to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, running 
LR on one half of data and making predictions on second half. Then it 
was collecting the equation coefficients from each iteration of 
bootstrap. Later I was just taking medians of these coefficients from 
all iterations, and used them as an optimal model - it really worked well!


Why not use maximum likelihood estimation, i.e., the coefficients from 
the original fit.  How does the bootstrap improve on that?




Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also experimented 
with function 'sm.binomial.bootstrap' from package sm. I tried also 
the function 'boot' from package boot, though without success - in my 
case it randomly selected _columns_ from my data frame, while I wanted 
it to select _rows_.


validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased estimate of the model 
performance, i.e., to correct for overfitting.


Frank Harrell



Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation coefficients from 
any of these bootstrap functions, in the same form as given by 'glm' or 
'lrm'.


Many thanks in advance!




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread Michal Figurski

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set are just a 
small sample of a big population, then if I use my whole dataset to 
obtain max likelihood estimates, these estimates may be best for this 
dataset, but far from ideal for the whole population.


I used bootstrap to virtually increase the size of my dataset, it should 
result in estimates more close to that from the population - isn't it 
the purpose of bootstrap?


When I use such median coefficients on another dataset (another sample 
from population), the predictions are better, than using max likelihood 
estimates. I have already tested that and it worked!


I am not a statistician and I don't feel what overfitting is, but it 
may be just another word for the same idea.


Nevertheless, I would still like to know how can I get the coeffcients 
for the model that gives the nearly unbiased estimates. I greatly 
appreciate your help.


--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

Frank E Harrell Jr wrote:

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using 
bootstrap. I was previously using SAS for this kind of tasks, but I am 
now switching to R.


My data frame consists of 5 columns and has 109 rows. Each row is a 
single record composed of the following values: Subject_name, 
numeric1, numeric2, numeric3 and outcome (yes or no). All three 
numerics are used to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, running 
LR on one half of data and making predictions on second half. Then it 
was collecting the equation coefficients from each iteration of 
bootstrap. Later I was just taking medians of these coefficients from 
all iterations, and used them as an optimal model - it really worked 
well!


Why not use maximum likelihood estimation, i.e., the coefficients from 
the original fit.  How does the bootstrap improve on that?




Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also experimented 
with function 'sm.binomial.bootstrap' from package sm. I tried also 
the function 'boot' from package boot, though without success - in 
my case it randomly selected _columns_ from my data frame, while I 
wanted it to select _rows_.


validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased estimate of the model 
performance, i.e., to correct for overfitting.


Frank Harrell



Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation coefficients 
from any of these bootstrap functions, in the same form as given by 
'glm' or 'lrm'.


Many thanks in advance!






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread Doran, Harold
 I used bootstrap to virtually increase the size of my 
 dataset, it should result in estimates more close to that 
 from the population - isn't it the purpose of bootstrap?

No, not really. The bootstrap is a resampling method for variance
estimation. It is often used when there is not an easy way, or a closed
form expression, for estimating the sampling variance of a statistic.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread 刘杰
Hi Doran,

Maybe I am wrong, but I think bootstrap is a general resampling method which
can be used for different purposes...Usually it works well when you do not
have a presentative sample set (maybe with limited number of samples).
Therefore, I am positive with Michal...

P.S., overfitting, in my opinion, is used to depict when you got a model
which is quite specific for the training dataset but cannot be generalized
with new samples..

Thanks,

--Jerry
2008/7/21 Doran, Harold [EMAIL PROTECTED]:

  I used bootstrap to virtually increase the size of my
  dataset, it should result in estimates more close to that
  from the population - isn't it the purpose of bootstrap?

 No, not really. The bootstrap is a resampling method for variance
 estimation. It is often used when there is not an easy way, or a closed
 form expression, for estimating the sampling variance of a statistic.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread Doran, Harold
Well, here is a good source--wikipedia. 
 
http://en.wikipedia.org/wiki/Bootstrapping_(statistics)




From: Áõ½Ü [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 21, 2008 3:56 PM
To: Doran, Harold
Cc: Michal Figurski; Frank E Harrell Jr; r-help@r-project.org
Subject: Re: [R] Coefficients of Logistic Regression from bootstrap - 
how to get them?


Hi Doran,
 
Maybe I am wrong, but I think bootstrap is a general resampling method 
which can be used for different purposes...Usually it works well when you do 
not have a presentative sample set (maybe with limited number of samples). 
Therefore, I am positive with Michal...
 
P.S., overfitting, in my opinion, is used to depict when you got a 
model which is quite specific for the training dataset but cannot be 
generalized with new samples..
 
Thanks,
 
--Jerry

2008/7/21 Doran, Harold [EMAIL PROTECTED]:


 I used bootstrap to virtually increase the size of my
 dataset, it should result in estimates more close to that
 from the population - isn't it the purpose of bootstrap?


No, not really. The bootstrap is a resampling method for 
variance
estimation. It is often used when there is not an easy way, or 
a closed
form expression, for estimating the sampling variance of a 
statistic.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible 
code.




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-21 Thread Frank E Harrell Jr

Michal Figurski wrote:

Frank,

How does bootstrap improve on that?

I don't know, but I have an idea. Since the data in my set are just a 
small sample of a big population, then if I use my whole dataset to 
obtain max likelihood estimates, these estimates may be best for this 
dataset, but far from ideal for the whole population.


The bootstrap, being a resampling procedure from your sample, has the 
same issues about the population as MLEs.




I used bootstrap to virtually increase the size of my dataset, it should 
result in estimates more close to that from the population - isn't it 
the purpose of bootstrap?


No



When I use such median coefficients on another dataset (another sample 
from population), the predictions are better, than using max likelihood 
estimates. I have already tested that and it worked!


Then your testing procedure is probably not valid.



I am not a statistician and I don't feel what overfitting is, but it 
may be just another word for the same idea.


Nevertheless, I would still like to know how can I get the coeffcients 
for the model that gives the nearly unbiased estimates. I greatly 
appreciate your help.


More info in my book Regression Modeling Strategies.

Frank



--
Michal J. Figurski
HUP, Pathology  Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

Frank E Harrell Jr wrote:

Michal Figurski wrote:

Hello all,

I am trying to optimize my logistic regression model by using 
bootstrap. I was previously using SAS for this kind of tasks, but I 
am now switching to R.


My data frame consists of 5 columns and has 109 rows. Each row is a 
single record composed of the following values: Subject_name, 
numeric1, numeric2, numeric3 and outcome (yes or no). All three 
numerics are used to predict outcome using LR.


In SAS I have written a macro, that was splitting the dataset, 
running LR on one half of data and making predictions on second half. 
Then it was collecting the equation coefficients from each iteration 
of bootstrap. Later I was just taking medians of these coefficients 
from all iterations, and used them as an optimal model - it really 
worked well!


Why not use maximum likelihood estimation, i.e., the coefficients from 
the original fit.  How does the bootstrap improve on that?




Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package Design, and I also experimented 
with function 'sm.binomial.bootstrap' from package sm. I tried also 
the function 'boot' from package boot, though without success - in 
my case it randomly selected _columns_ from my data frame, while I 
wanted it to select _rows_.


validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased estimate of the 
model performance, i.e., to correct for overfitting.


Frank Harrell



Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation coefficients 
from any of these bootstrap functions, in the same form as given by 
'glm' or 'lrm'.


Many thanks in advance!









--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.