Re: [R] Random Forest Cross Validation
Thanks to you all! Now I got it! -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3327384.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Cross Validation
Exactly as Max said. See the rfcv() function in the latest version of randomForest, as well as the reference in the help page for that function. OOB estimate is as accurate as CV estimate _if_ you run straight RF. Most other methods do not have this feature. However, if you start adding steps such as feature selections, all bets are off. Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of mxkuhn Sent: Tuesday, February 22, 2011 7:17 PM To: ronzhao Cc: r-help@r-project.org Subject: Re: [R] Random Forest Cross Validation If you want to get honest estimates of accuracy, you should repeat the feature selection within the resampling (not the test set). You will get different lists each time, but that's the point. Right now you are not capturing that uncertainty which is why the oob and test set results differ so much. The list you get int the original training set is still the real list. The resampling results help you understand how much you might be overfitting the *variables*. Max On Feb 22, 2011, at 4:39 PM, ronzhao yzhaoh...@gmail.com wrote: Thanks, Max. Yes, I did some feature selections in the training set. Basically, I selected the top 1000 SNPs based on OOB error and grow the forest using training set, then using the test set to validate the forest grown. But if I do the same thing in test set, the top SNPs would be different than those in training set. That may be difficult to interpret. -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-t p3314777p3320094.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Cross Validation
Thanks, Max. Yes, I did some feature selections in the training set. Basically, I selected the top 1000 SNPs based on OOB error and grow the forest using training set, then using the test set to validate the forest grown. But if I do the same thing in test set, the top SNPs would be different than those in training set. That may be difficult to interpret. -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Cross Validation
If you want to get honest estimates of accuracy, you should repeat the feature selection within the resampling (not the test set). You will get different lists each time, but that's the point. Right now you are not capturing that uncertainty which is why the oob and test set results differ so much. The list you get int the original training set is still the real list. The resampling results help you understand how much you might be overfitting the *variables*. Max On Feb 22, 2011, at 4:39 PM, ronzhao yzhaoh...@gmail.com wrote: Thanks, Max. Yes, I did some feature selections in the training set. Basically, I selected the top 1000 SNPs based on OOB error and grow the forest using training set, then using the test set to validate the forest grown. But if I do the same thing in test set, the top SNPs would be different than those in training set. That may be difficult to interpret. -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Cross Validation
I am using randomForest package to do some prediction job on GWAS data. I firstly split the data into training and testing set (70% vs 30%), then using training set to grow the trees (ntree=10). It looks that the OOB error in training set is good (10%). However, it is not very good for the test set with a AUC only about 50%. Did you do any feature selection in the training set? If so, you also need to include that step in the cross-validation to get realistic performance estimates (see Ambroise and McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences (2002) vol. 99 (10) pp. 6562-6566). In the caret package, train() can be used to get cross-validation estimates for RF and the sbf() function (for selection by filter) can be used to include simple univariate filters in the CV procedure. Although some people said no cross-validation was necessary for RF, I still felt unsafe and thought a testing set is important. I felt really frustrated with the results. CV is needed when you want an assessment of performance on a test set. In this sense, RF is like any other method. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.