Re: [R] randomForest gives different results for formula call v. x, y methods. Why?
On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote: Just out of curiosity, I took the default iris example in the RF helpfile... but seeing the admonition against using the formula interface for large data sets, I wanted to play around a bit to see how the various options affected the output. Found something interesting I couldn't find documentation for... Just like the example... set.seed(12) # to be sure I have reproducibility No differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with: require(randomForest) Loading required package: randomForest randomForest 4.5-18 *if* I reset the seed before each call to randomForest. Your example code doesn't seem to be resetting the random seed before each run. As such, each run is using a different set of random variables at each bootstrap sample. E.g. runs all same with reset seed: set.seed(12) randomForest(Species ~ ., data=iris) Call: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 set.seed(12) randomForest(x=iris[,1:4],y=iris[,5]) Call: randomForest(x = iris[, 1:4], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 set.seed(12) randomForest(x=iris[,c(1:4)],y=iris[,5]) Call: randomForest(x = iris[, c(1:4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 set.seed(12) randomForest(x=iris[,c(1,2,3,4)],y=iris[,5]) Call: randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 HTH G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC [f] +44 (0)20 7679 0565 UCL Department of Geography Pearson Building [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street London, UK[w] http://www.ucl.ac.uk/~ucfagls/ WC1E 6BT [w] http://www.freshwaters.org.uk/ %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest gives different results for formula call v. x, y methods. Why?
In the words of Simpson (2007), D'OH! I knew it had to be something simple! On 4/29/07, Gavin Simpson [EMAIL PROTECTED] wrote: On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote: Just out of curiosity, I took the default iris example in the RF helpfile... but seeing the admonition against using the formula interface for large data sets, I wanted to play around a bit to see how the various options affected the output. Found something interesting I couldn't find documentation for... Just like the example... set.seed(12) # to be sure I have reproducibility No differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with: require(randomForest) Loading required package: randomForest randomForest 4.5-18 *if* I reset the seed before each call to randomForest. Your example code doesn't seem to be resetting the random seed before each run. As such, each run is using a different set of random variables at each bootstrap sample. E.g. runs all same with reset seed: set.seed(12) randomForest(Species ~ ., data=iris) Call: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 set.seed(12) randomForest(x=iris[,1:4],y=iris[,5]) Call: randomForest(x = iris[, 1:4], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 set.seed(12) randomForest(x=iris[,c(1:4)],y=iris[,5]) Call: randomForest(x = iris[, c(1:4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 set.seed(12) randomForest(x=iris[,c(1,2,3,4)],y=iris[,5]) Call: randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 HTH G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC [f] +44 (0)20 7679 0565 UCL Department of Geography Pearson Building [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street London, UK[w] http://www.ucl.ac.uk/~ucfagls/ WC1E 6BT [w] http://www.freshwaters.org.uk/ %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% -- --- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. --James Madison [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] randomForest gives different results for formula call v. x, y methods. Why?
Just out of curiosity, I took the default iris example in the RF helpfile... but seeing the admonition against using the formula interface for large data sets, I wanted to play around a bit to see how the various options affected the output. Found something interesting I couldn't find documentation for... Just like the example... set.seed(12) # to be sure I have reproducibility form.rf-randomForest(Species ~ ., data=iris) form.rf Call: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 4460.08 long.rf-randomForest(x=iris[,1:4],y=iris[,5]) long.rf Call: randomForest(x = iris[, 1:4], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 3470.06 (Now, if I had non-contiguous columns for predictors, I'd have to call it this way) long2.rf-randomForest(x=iris[,c(1:4)],y=iris[,5]) long2.rf Call: randomForest(x = iris[, c(1:4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 5.33% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 5450.10 Any idea why these two should give different results? I can only figure that the seed, even though it's set, somehow gets altered by the use of a formula long3.rf-randomForest(x=iris[,c(1,2,3,4)],y=iris[,5]) long3.rf Call: randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 00.00 versicolor 0 47 30.06 virginica 0 4460.08 Either that or I'm calling it wrong in the long example, or else there's a bug. Not a life threatening situation, but I am curious as to the mechanics of this. I use that sort of column identification all the time and it seems to work OK, but here I get different results (form.rf v. long.rf or long2.rf) or not (form.rf v. long3.rf) depending how I call the function. Any insights? -- --- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy. --James Madison [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.