Re: [R] Random Forest Cross Validation

2011-02-27 Thread ronzhao
Thanks to you all!

Now I got it!

-- 
View this message in context: 
http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3327384.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest Cross Validation

2011-02-24 Thread Liaw, Andy
Exactly as Max said.  See the rfcv() function in the latest version of 
randomForest, as well as the reference in the help page for that function.

OOB estimate is as accurate as CV estimate _if_ you run straight RF.  Most 
other methods do not have this feature.  However, if you start adding steps 
such as feature selections, all bets are off.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of mxkuhn
 Sent: Tuesday, February 22, 2011 7:17 PM
 To: ronzhao
 Cc: r-help@r-project.org
 Subject: Re: [R] Random Forest  Cross Validation
 
 If you want to get honest estimates of accuracy, you should 
 repeat the feature selection within the resampling (not the 
 test set). You will get different lists each time, but that's 
 the point. Right now you are not capturing that uncertainty 
 which is why the oob and test set results differ so much.
 
 The list you get int the original training set is still the 
 real list. The resampling results help you understand how 
 much you might be overfitting the *variables*.
 
 Max
 
 On Feb 22, 2011, at 4:39 PM, ronzhao yzhaoh...@gmail.com wrote:
 
  
  Thanks, Max.
  
  Yes, I did some feature selections in the training set. Basically, I
  selected the top 1000 SNPs based on OOB error and grow the 
 forest using
  training set, then using the test set to validate the forest grown.
  
  But if I do the same thing in test set, the top SNPs would 
 be different than
  those in training set. That may be difficult to interpret.
  
  
  
  
  -- 
  View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-t
p3314777p3320094.html
  Sent from the R help mailing list archive at Nabble.com.
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest Cross Validation

2011-02-22 Thread ronzhao

Thanks, Max.

Yes, I did some feature selections in the training set. Basically, I
selected the top 1000 SNPs based on OOB error and grow the forest using
training set, then using the test set to validate the forest grown.

But if I do the same thing in test set, the top SNPs would be different than
those in training set. That may be difficult to interpret.




-- 
View this message in context: 
http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest Cross Validation

2011-02-22 Thread mxkuhn
If you want to get honest estimates of accuracy, you should repeat the feature 
selection within the resampling (not the test set). You will get different 
lists each time, but that's the point. Right now you are not capturing that 
uncertainty which is why the oob and test set results differ so much.

The list you get int the original training set is still the real list. The 
resampling results help you understand how much you might be overfitting the 
*variables*.

Max

On Feb 22, 2011, at 4:39 PM, ronzhao yzhaoh...@gmail.com wrote:

 
 Thanks, Max.
 
 Yes, I did some feature selections in the training set. Basically, I
 selected the top 1000 SNPs based on OOB error and grow the forest using
 training set, then using the test set to validate the forest grown.
 
 But if I do the same thing in test set, the top SNPs would be different than
 those in training set. That may be difficult to interpret.
 
 
 
 
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest Cross Validation

2011-02-20 Thread Max Kuhn
 I am using randomForest package to do some prediction job on GWAS data. I
 firstly split the data into training and testing set (70% vs 30%), then
 using training set to grow the trees (ntree=10). It looks that the OOB
 error in training set is good (10%). However, it is not very good for the
 test set with a AUC only about 50%.

Did you do any feature selection in the training set? If so, you also
need to include that step in the cross-validation to get realistic
performance estimates (see Ambroise and McLachlan. Selection bias in
gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences (2002) vol. 99 (10)
pp. 6562-6566).

In the caret package, train() can be used to get cross-validation
estimates for RF and the sbf() function (for selection by filter) can
be used to include simple univariate filters in the CV procedure.

 Although some people said no cross-validation was necessary for RF, I still
 felt unsafe and thought a testing set is important. I felt really frustrated
 with the results.

CV is needed when you want an assessment of performance on a test set.
In this sense, RF is like any other method.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.