Re: [R] randomForest gives different results for formula call v. x, y methods. Why?

2007-04-29 Thread Gavin Simpson
On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote:
 Just out of curiosity, I took the default iris example in the RF
 helpfile...
 but seeing the admonition against using the formula interface for large data
 sets, I wanted to play around a bit to see how the various options affected
 the output. Found something interesting I couldn't find documentation for...
 
 Just like the example...
  set.seed(12) # to be sure I have reproducibility

No differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with:

 require(randomForest)
Loading required package: randomForest
randomForest 4.5-18

*if* I reset the seed before each call to randomForest.

Your example code doesn't seem to be resetting the random seed before
each run. As such, each run is using a different set of random variables
at each bootstrap sample.

E.g. runs all same with reset seed:

 set.seed(12)
 randomForest(Species ~ ., data=iris)

Call:
 randomForest(formula = Species ~ ., data = iris)
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  3470.06
 set.seed(12)
 randomForest(x=iris[,1:4],y=iris[,5])

Call:
 randomForest(x = iris[, 1:4], y = iris[, 5])
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  3470.06
 set.seed(12)
 randomForest(x=iris[,c(1:4)],y=iris[,5])

Call:
 randomForest(x = iris[, c(1:4)], y = iris[, 5])
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  3470.06
 set.seed(12)
 randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])

Call:
 randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5])
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  3470.06

HTH

G
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson [t] +44 (0)20 7679 0522
ECRC  [f] +44 (0)20 7679 0565
UCL Department of Geography
Pearson Building  [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street
London, UK[w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT  [w] http://www.freshwaters.org.uk/
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest gives different results for formula call v. x, y methods. Why?

2007-04-29 Thread David L. Van Brunt, Ph.D.
In the words of Simpson (2007), D'OH!

I knew it had to be something simple!

On 4/29/07, Gavin Simpson [EMAIL PROTECTED] wrote:

 On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote:
  Just out of curiosity, I took the default iris example in the RF
  helpfile...
  but seeing the admonition against using the formula interface for large
 data
  sets, I wanted to play around a bit to see how the various options
 affected
  the output. Found something interesting I couldn't find documentation
 for...
 
  Just like the example...
   set.seed(12) # to be sure I have reproducibility

 No differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with:

  require(randomForest)
 Loading required package: randomForest
 randomForest 4.5-18

 *if* I reset the seed before each call to randomForest.

 Your example code doesn't seem to be resetting the random seed before
 each run. As such, each run is using a different set of random variables
 at each bootstrap sample.

 E.g. runs all same with reset seed:

  set.seed(12)
  randomForest(Species ~ ., data=iris)

 Call:
 randomForest(formula = Species ~ ., data = iris)
Type of random forest: classification
  Number of trees: 500
 No. of variables tried at each split: 2

 OOB estimate of  error rate: 4%
 Confusion matrix:
setosa versicolor virginica class.error
 setosa 50  0 00.00
 versicolor  0 47 30.06
 virginica   0  3470.06
  set.seed(12)
  randomForest(x=iris[,1:4],y=iris[,5])

 Call:
 randomForest(x = iris[, 1:4], y = iris[, 5])
Type of random forest: classification
  Number of trees: 500
 No. of variables tried at each split: 2

 OOB estimate of  error rate: 4%
 Confusion matrix:
setosa versicolor virginica class.error
 setosa 50  0 00.00
 versicolor  0 47 30.06
 virginica   0  3470.06
  set.seed(12)
  randomForest(x=iris[,c(1:4)],y=iris[,5])

 Call:
 randomForest(x = iris[, c(1:4)], y = iris[, 5])
Type of random forest: classification
  Number of trees: 500
 No. of variables tried at each split: 2

 OOB estimate of  error rate: 4%
 Confusion matrix:
setosa versicolor virginica class.error
 setosa 50  0 00.00
 versicolor  0 47 30.06
 virginica   0  3470.06
  set.seed(12)
  randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])

 Call:
 randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5])
Type of random forest: classification
  Number of trees: 500
 No. of variables tried at each split: 2

 OOB estimate of  error rate: 4%
 Confusion matrix:
setosa versicolor virginica class.error
 setosa 50  0 00.00
 versicolor  0 47 30.06
 virginica   0  3470.06

 HTH

 G
 --
 %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC  [f] +44 (0)20 7679 0565
 UCL Department of Geography
 Pearson Building  [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street
 London, UK[w] http://www.ucl.ac.uk/~ucfagls/
 WC1E 6BT  [w] http://www.freshwaters.org.uk/
 %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%




-- 
---
David L. Van Brunt, Ph.D.
mailto:[EMAIL PROTECTED]

If Tyranny and Oppression come to this land, it will be in the guise of
fighting a foreign enemy.
--James Madison

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] randomForest gives different results for formula call v. x, y methods. Why?

2007-04-28 Thread David L. Van Brunt, Ph.D.
Just out of curiosity, I took the default iris example in the RF
helpfile...
but seeing the admonition against using the formula interface for large data
sets, I wanted to play around a bit to see how the various options affected
the output. Found something interesting I couldn't find documentation for...

Just like the example...
 set.seed(12) # to be sure I have reproducibility

 form.rf-randomForest(Species ~ ., data=iris)
 form.rf

Call:
 randomForest(formula = Species ~ ., data = iris)
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4.67%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  4460.08

 long.rf-randomForest(x=iris[,1:4],y=iris[,5])
 long.rf
Call:
 randomForest(x = iris[, 1:4], y = iris[, 5])
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  3470.06


(Now, if I had non-contiguous columns for predictors, I'd have to call it
this way)

 long2.rf-randomForest(x=iris[,c(1:4)],y=iris[,5])
 long2.rf

Call:
 randomForest(x = iris[, c(1:4)], y = iris[, 5])
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 5.33%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  5450.10


Any idea why these two should give different results? I can only figure that
the seed, even though it's set, somehow gets altered by the use of a
formula
 long3.rf-randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])
 long3.rf

Call:
 randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5])
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 4.67%
Confusion matrix:
   setosa versicolor virginica class.error
setosa 50  0 00.00
versicolor  0 47 30.06
virginica   0  4460.08


Either that or I'm calling it wrong in the long example, or else there's a
bug. Not a life threatening situation, but I am curious as to the mechanics
of this. I use that sort of column identification all the time and it seems
to work OK, but here I get different results (form.rf v. long.rf or long2.rf)
or not (form.rf v. long3.rf)  depending how I call the function. Any
insights?


-- 
---
David L. Van Brunt, Ph.D.
mailto:[EMAIL PROTECTED]

If Tyranny and Oppression come to this land, it will be in the guise of
fighting a foreign enemy.
--James Madison

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.