Re: [R] Getting the C-index for a dataset that was not used to generate the logistic model

2009-07-17 Thread Kyle Werner
Professor Harrell,

Thanks for your lightning-fast reply. It was extremely helpful, and
pointed me exactly to where I needed to go to solve my problem.

For others reading, my problem was that I was incorrectly dealing with
the validation data.

I tried to do this to generate predictions from a pre-existing model
with new data:
lrm(formula = logit.lrm,data=validationData)

However, as Prof. Harrell kindly implied, I actually needed to use the
original model to predict the probabilities for the new data set:
logit.lrm.validationPredictions -
predict(logit.lrm,newdata=validationData,type=\fitted.ind\)

From there, I could cobble together a dataframe of the actual results
in the new dataset with the predicted probabilities based on the
model, and regress from there. This allowed me to generate my
statistic of interest (the C-index).

Again, thank you,

Kyle



On Thu, Jul 16, 2009 at 9:18 PM, Frank E Harrell
Jrf.harr...@vanderbilt.edu wrote:
 Kyle Werner wrote:

 Does anyone know how to get the C-index from a logistic model - not using
 the dataset that was used to train the model, but instead using a fresh
 dataset on the same model?

 I have a dataset of 400 points that I've split into two halves, one for
 training the logistic model, and the other for evaluating it. The
 structure
 is as follows:

 Kyle - I would not trust data splitting with N  20,000.


 column headers are got a loan (dichotomous), hourly income
 (continuous),
 and owns own home (dichotomous)
 The training data is
 *trainingData[1,] = c(0,12,0)*
 *...*
 etc

 and the validation data is
 *validationData[1,] = c(1,35,1)*
 *...*
 etc

 I use Prof. Harrell's excellent Design modules to perform a logistic
 regression on the training data like so:
 *logit.lrm - lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)*
 *lrm(formula = logit.lrm)$stats[6]*
 (output is C 0.8739827 - i.e., just the C-index)
 **
 I really like the ability to extract the C-index (or ROC AUC), because
 this
 is a factor that I find very helpful in comparing various models. However,
 I
 don't really want to get that from the data that the model was built on.
 Using that C-statistic would be cheating, in a sense, since I'm just
 testing
 the model on the data it was built against. I would rather get the
 C-statistic by applying the model I just generated to the other half of
 the
 data that I saved.

 I have tried doing this:
 *lrm(formula = logit.lrm,data=validationData)*
 However, this actually generates a new model (giving different
 coefficients
 to the variables). It doesn't simply apply the new data to the model from
 *
 logit.lrm* that I generated before.

 If you are just fitting a new model with the only predictor being the
 predicted log odds, it is true you will get a new slope and intercept, but
 this will not affect the c-index.  So you can trust the output (for the
 c-index and other rank measures such as Dxy, tau, gamma).

 Or use rcorr.cens(predict(fit, newdata), newdata$y) and use Dxy=2*(C-.5).
  You can use somers2( ) if you don't need the standard error.

 Frank


 So, can someone point me in the right direction for evaluating the model
 that I built with trainingData, but getting the C-statistic against my
 validationData?

 Thanks so much,

 Kyle Werner

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 --
 Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting the C-index for a dataset that was not used to generate the logistic model

2009-07-16 Thread Frank E Harrell Jr

Kyle Werner wrote:

Does anyone know how to get the C-index from a logistic model - not using
the dataset that was used to train the model, but instead using a fresh
dataset on the same model?

I have a dataset of 400 points that I've split into two halves, one for
training the logistic model, and the other for evaluating it. The structure
is as follows:


Kyle - I would not trust data splitting with N  20,000.



column headers are got a loan (dichotomous), hourly income (continuous),
and owns own home (dichotomous)
The training data is
*trainingData[1,] = c(0,12,0)*
*...*
etc

and the validation data is
*validationData[1,] = c(1,35,1)*
*...*
etc

I use Prof. Harrell's excellent Design modules to perform a logistic
regression on the training data like so:
*logit.lrm - lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)*
*lrm(formula = logit.lrm)$stats[6]*
(output is C 0.8739827 - i.e., just the C-index)
**
I really like the ability to extract the C-index (or ROC AUC), because this
is a factor that I find very helpful in comparing various models. However, I
don't really want to get that from the data that the model was built on.
Using that C-statistic would be cheating, in a sense, since I'm just testing
the model on the data it was built against. I would rather get the
C-statistic by applying the model I just generated to the other half of the
data that I saved.

I have tried doing this:
*lrm(formula = logit.lrm,data=validationData)*
However, this actually generates a new model (giving different coefficients
to the variables). It doesn't simply apply the new data to the model from *
logit.lrm* that I generated before.


If you are just fitting a new model with the only predictor being the 
predicted log odds, it is true you will get a new slope and intercept, 
but this will not affect the c-index.  So you can trust the output (for 
the c-index and other rank measures such as Dxy, tau, gamma).


Or use rcorr.cens(predict(fit, newdata), newdata$y) and use 
Dxy=2*(C-.5).  You can use somers2( ) if you don't need the standard error.


Frank



So, can someone point me in the right direction for evaluating the model
that I built with trainingData, but getting the C-statistic against my
validationData?

Thanks so much,

Kyle Werner

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.