RE: [R] an off-topic question -> model validation

2004-11-12 Thread bogdan romocea
Assuming you have enough data, usually 1/4 to 1/2 is used for
validation. 

One reference would be
Picard, R.R. and Berk, K.N. (1990)
"Data Splitting," The American Statistician, 44;140-147.

hth,
b.

-Original Message-
From: Wensui Liu [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 10:20 PM
To: [EMAIL PROTECTED]
Subject: [R] an off-topic question -> model validation


Currently, I am working on a data mining project and plan to divide
the data table into 2 parts, one for modeling and the other for
validation to compare several models.

But I am not sure about the percentage of data I should use to build
the model and the one I should keep to validate the model.

Is there any literature reference about this topic? 

Thank you so much!

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] an off-topic question -> model validation

2004-11-11 Thread Frank E Harrell Jr
Wensui Liu wrote:
Currently, I am working on a data mining project and plan to divide
the data table into 2 parts, one for modeling and the other for
validation to compare several models.
But I am not sure about the percentage of data I should use to build
the model and the one I should keep to validate the model.
Is there any literature reference about this topic? 

Thank you so much!
Data splitting is very inefficient for model validation unless the 
sample size is extremely large.  Consider using Efron's "optimism" 
bootstrap as is used in the validate function in the Design package. 
validate will also do data splitting and cross-validation though.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html