[R] Model validation and penalization with rms package

2010-06-28 Thread Mark Seeto
I’ve been using Frank Harrell’s rms package to do bootstrap model
validation. Is it the case that the optimum penalization may still
give a model which is substantially overfitted?

I calculated corrected R^2, optimism in R^2, and corrected slope for
various penalties for a simple example:

x1 - rnorm(45)
x2 - rnorm(45)
x3 - rnorm(45)
y - x1 + 2*x2 + rnorm(45,0,3)

ols0 - ols(y ~ x1 + x2 + x3, x=TRUE, y=TRUE)

corrected.Rsq - rep(0,60)
optimism.Rsq - rep(0,60)
corrected.slope - rep(0,60)

for (pen in 1:60) {
olspen - ols(y ~ x1 + x2 + x3, penalty=pen, x=TRUE, y=TRUE)
val - validate(olspen, B=200)
corrected.Rsq[pen] - val[R-square, index.corrected]
optimism.Rsq[pen] - val[R-square, optimism]
corrected.slope[pen] - val[Slope, index.corrected]
}
plot(corrected.Rsq)
x11(); plot(optimism.Rsq)
x11(); plot(corrected.slope)
p - pentrace(ols0, penalty=1:60)
ols9 - ols(y ~ x1 + x2 + x3, penalty=9, x=TRUE, y=TRUE)
validate(ols9, B=200)
index.orig  training   test   optimism
index.corrected   n
R-square0.2523404 0.2820734  0.1911878  0.09088563   0.1614548 200
MSE 7.8497722 7.3525300  8.4918212 -1.13929116   8.9890634 200
Intercept   0.000 0.000 -0.1353572  0.13535717  -0.1353572 200
Slope   1.000 1.000  1.1707137 -0.17071372   1.1707137 200

pentrace tells me that of the penalties 1, 2,..., 60, corrected AIC is
maximised by a penalty of 9. This is consistent with the corrected R^2
plot, which shows a maximum somewhere around 10. However, a penalty of
9 still gives an R^2 optimism of 0.09 (training R^2=0.28, test
R^2=0.19), suggesting overfitting.

Do we just have to live with this R^2 optimism? It can be decreased by
taking a bigger penalty, but then the corrected R^2 is reduced.  Also,
a penalty of 9 gives a corrected slope of about 1.17 (corrected slope
of 1 is achieved with a penalty of about 1 or 2).

Thanks for any help/advice you can give.

Mark
--
Mark Seeto
Statistician

National Acoustic Laboratories
A Division of Australian Hearing

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Model validation and penalization with rms package

2010-06-28 Thread Charles C. Berry

On Tue, 29 Jun 2010, Mark Seeto wrote:


I’ve been using Frank Harrell’s rms package to do bootstrap model
validation. Is it the case that the optimum penalization may still
give a model which is substantially overfitted?

I calculated corrected R^2, optimism in R^2, and corrected slope for
various penalties for a simple example:

x1 - rnorm(45)
x2 - rnorm(45)
x3 - rnorm(45)
y - x1 + 2*x2 + rnorm(45,0,3)

ols0 - ols(y ~ x1 + x2 + x3, x=TRUE, y=TRUE)

corrected.Rsq - rep(0,60)
optimism.Rsq - rep(0,60)
corrected.slope - rep(0,60)

for (pen in 1:60) {
olspen - ols(y ~ x1 + x2 + x3, penalty=pen, x=TRUE, y=TRUE)
val - validate(olspen, B=200)
corrected.Rsq[pen] - val[R-square, index.corrected]
optimism.Rsq[pen] - val[R-square, optimism]
corrected.slope[pen] - val[Slope, index.corrected]
}
plot(corrected.Rsq)
x11(); plot(optimism.Rsq)
x11(); plot(corrected.slope)
p - pentrace(ols0, penalty=1:60)
ols9 - ols(y ~ x1 + x2 + x3, penalty=9, x=TRUE, y=TRUE)
validate(ols9, B=200)
index.orig  training   test   optimism
index.corrected   n
R-square0.2523404 0.2820734  0.1911878  0.09088563   0.1614548 200
MSE 7.8497722 7.3525300  8.4918212 -1.13929116   8.9890634 200
Intercept   0.000 0.000 -0.1353572  0.13535717  -0.1353572 200
Slope   1.000 1.000  1.1707137 -0.17071372   1.1707137 200

pentrace tells me that of the penalties 1, 2,..., 60, corrected AIC is
maximised by a penalty of 9. This is consistent with the corrected R^2
plot, which shows a maximum somewhere around 10. However, a penalty of
9 still gives an R^2 optimism of 0.09 (training R^2=0.28, test
R^2=0.19), suggesting overfitting.

Do we just have to live with this R^2 optimism? It can be decreased by
taking a bigger penalty, but then the corrected R^2 is reduced.  Also,
a penalty of 9 gives a corrected slope of about 1.17 (corrected slope
of 1 is achieved with a penalty of about 1 or 2).



Your best bet, as you are a statistician, is to read up on the 
bias-variance tradeoff (aka dilemma).


I recommend that you focus on Section 5. Bias, variance, and estimation 
error in Friedman's On Bias, Variance, 0/1 Loss, and the 
Curse-of-Dimensionality available online at


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.6820rep=rep1type=pdf


In short, overfitting aka 'optimism' is due to variance, and you can 
temper it by adding bias. But as you add more you lose the signal in the 
data. The resubstituted training data will tend to overstate the 
prediction accuracy unless you penalize so severely that the prediction is 
unrelated to the training data.



HTH,

Chuck



Thanks for any help/advice you can give.

Mark
--
Mark Seeto
Statistician

National Acoustic Laboratories
A Division of Australian Hearing

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Charles C. Berry(858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu   UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.