Professor Harrell,
Thanks for your lightning-fast reply. It was extremely helpful, and
pointed me exactly to where I needed to go to solve my problem.
For others reading, my problem was that I was incorrectly dealing with
the validation data.
I tried to do this to generate predictions from a pre-existing model
with new data:
lrm(formula = logit.lrm,data=validationData)
However, as Prof. Harrell kindly implied, I actually needed to use the
original model to predict the probabilities for the new data set:
logit.lrm.validationPredictions -
predict(logit.lrm,newdata=validationData,type=\fitted.ind\)
From there, I could cobble together a dataframe of the actual results
in the new dataset with the predicted probabilities based on the
model, and regress from there. This allowed me to generate my
statistic of interest (the C-index).
Again, thank you,
Kyle
On Thu, Jul 16, 2009 at 9:18 PM, Frank E Harrell
Jrf.harr...@vanderbilt.edu wrote:
Kyle Werner wrote:
Does anyone know how to get the C-index from a logistic model - not using
the dataset that was used to train the model, but instead using a fresh
dataset on the same model?
I have a dataset of 400 points that I've split into two halves, one for
training the logistic model, and the other for evaluating it. The
structure
is as follows:
Kyle - I would not trust data splitting with N 20,000.
column headers are got a loan (dichotomous), hourly income
(continuous),
and owns own home (dichotomous)
The training data is
*trainingData[1,] = c(0,12,0)*
*...*
etc
and the validation data is
*validationData[1,] = c(1,35,1)*
*...*
etc
I use Prof. Harrell's excellent Design modules to perform a logistic
regression on the training data like so:
*logit.lrm - lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)*
*lrm(formula = logit.lrm)$stats[6]*
(output is C 0.8739827 - i.e., just the C-index)
**
I really like the ability to extract the C-index (or ROC AUC), because
this
is a factor that I find very helpful in comparing various models. However,
I
don't really want to get that from the data that the model was built on.
Using that C-statistic would be cheating, in a sense, since I'm just
testing
the model on the data it was built against. I would rather get the
C-statistic by applying the model I just generated to the other half of
the
data that I saved.
I have tried doing this:
*lrm(formula = logit.lrm,data=validationData)*
However, this actually generates a new model (giving different
coefficients
to the variables). It doesn't simply apply the new data to the model from
*
logit.lrm* that I generated before.
If you are just fitting a new model with the only predictor being the
predicted log odds, it is true you will get a new slope and intercept, but
this will not affect the c-index. So you can trust the output (for the
c-index and other rank measures such as Dxy, tau, gamma).
Or use rcorr.cens(predict(fit, newdata), newdata$y) and use Dxy=2*(C-.5).
You can use somers2( ) if you don't need the standard error.
Frank
So, can someone point me in the right direction for evaluating the model
that I built with trainingData, but getting the C-statistic against my
validationData?
Thanks so much,
Kyle Werner
[[alternative HTML version deleted]]
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.