Re: [R] randomForest speed improvements

2011-01-05 Thread Liaw, Andy
Note that that isn't exactly what I recommended.  If you look at the
example in the help page for combine(), you'll see that it is combining
RF objects trained on the same data; i.e., instead of having one RF with
500 trees, you can combine five RFs trained on the same data with 100
trees each into one 500-tree RF.

The way you are using combine() is basically using sample size to limit
tree size, which you can do by playing with the nodesize argument in
randomForest() as I suggested previously.  Either way is fine as long as
you don't see prediction performance degrading.

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of apresley
 Sent: Tuesday, January 04, 2011 6:30 PM
 To: r-help@r-project.org
 Subject: Re: [R] randomForest speed improvements
 
 
 Andy,
 
 Thanks for the reply.  I had no idea I could combine them 
 back ... that
 actually will work pretty well.  We can have several worker 
 threads load
 up the RF's on different machines and/or cores, and then 
 re-assemble them. 
 RMPI might be an option down the road, but would be a bit of 
 overhead for us
 now.
 
 Using the method of combine() ... I was able to drastically reduce the
 amount of time to build randomForest objects.  IE, using 
 about 25,000 rows
 (6 columns), it takes maybe 5 minutes on my laptop.  Using 5 
 randomForest
 objects (each with 5k rows), and then combining them, takes  
 1 minute.
 
 --
 Anthony
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/randomForest-speed-improvements-
 tp3172523p3174621.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest speed improvements

2011-01-05 Thread Liaw, Andy
From: Liaw, Andy
 
 Note that that isn't exactly what I recommended.  If you look at the
 example in the help page for combine(), you'll see that it is 
 combining
 RF objects trained on the same data; i.e., instead of having 
 one RF with
 500 trees, you can combine five RFs trained on the same data with 100
 trees each into one 500-tree RF.
 
 The way you are using combine() is basically using sample 
 size to limit
 tree size, which you can do by playing with the nodesize argument in
 randomForest() as I suggested previously.  Either way is fine 
 as long as
 you don't see prediction performance degrading.

I should also mention that another way you can do something similar is
by making use of the sampsize argument in randomForest().  For example,
if you call randomForest() with sampsize=500, it will randomly draw 500
data points to grow each tree.  This way you don't even need to run the
RFs separately and combine them.  

Andy


 Andy
 
  -Original Message-
  From: r-help-boun...@r-project.org 
  [mailto:r-help-boun...@r-project.org] On Behalf Of apresley
  Sent: Tuesday, January 04, 2011 6:30 PM
  To: r-help@r-project.org
  Subject: Re: [R] randomForest speed improvements
  
  
  Andy,
  
  Thanks for the reply.  I had no idea I could combine them 
  back ... that
  actually will work pretty well.  We can have several worker 
  threads load
  up the RF's on different machines and/or cores, and then 
  re-assemble them. 
  RMPI might be an option down the road, but would be a bit of 
  overhead for us
  now.
  
  Using the method of combine() ... I was able to drastically 
 reduce the
  amount of time to build randomForest objects.  IE, using 
  about 25,000 rows
  (6 columns), it takes maybe 5 minutes on my laptop.  Using 5 
  randomForest
  objects (each with 5k rows), and then combining them, takes  
  1 minute.
  
  --
  Anthony
  -- 
  View this message in context: 
  http://r.789695.n4.nabble.com/randomForest-speed-improvements-
  tp3172523p3174621.html
  Sent from the R help mailing list archive at Nabble.com.
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
  
 Notice:  This e-mail message, together with any 
 attachme...{{dropped:11}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest speed improvements

2011-01-04 Thread Liaw, Andy
If you have multiple cores, one poor man's solution is to run separate
forests in different R sessions, save the RF objects, load them into the
same session and combine() them.  You can do this less clumsily if you
use things like Rmpi or other distributed computing packages.

Another consideration is to increase nodesize (which reduces the sizes
of trees).  The problem with numeric predictors for tree-based
algorithms is that the number of computations to find the best splitting
point increases by that much _at each node_.  Some algorithms try to
save on this by using only certain quantiles.  The current RF code
doesn't do this.

Andy

 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of apresley
 Sent: Monday, January 03, 2011 6:28 PM
 To: r-help@r-project.org
 Subject: Re: [R] randomForest speed improvements
 
 
 I haven't tried changing the mtry or ntree at all ... though 
 I suppose with
 only 6 variables, and tens-of-thousands of rows, we can 
 probably do less
 than 500 tree's (the default?).
 
 Although tossing the forest does speed things up a bit, seems 
 to be about 15
 - 20% faster in some cases, I need to keep the forest to do 
 the prediction,
 otherwise, it complains that there is no forest component in 
 the object.
 
 --
 Anthony
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/randomForest-speed-improvements-
 tp3172523p3172834.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest speed improvements

2011-01-04 Thread apresley

Andy,

Thanks for the reply.  I had no idea I could combine them back ... that
actually will work pretty well.  We can have several worker threads load
up the RF's on different machines and/or cores, and then re-assemble them. 
RMPI might be an option down the road, but would be a bit of overhead for us
now.

Using the method of combine() ... I was able to drastically reduce the
amount of time to build randomForest objects.  IE, using about 25,000 rows
(6 columns), it takes maybe 5 minutes on my laptop.  Using 5 randomForest
objects (each with 5k rows), and then combining them, takes  1 minute.

--
Anthony
-- 
View this message in context: 
http://r.789695.n4.nabble.com/randomForest-speed-improvements-tp3172523p3174621.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] randomForest speed improvements

2011-01-03 Thread apresley

Hi there,

We're trying to use randomForest to do some predictions.  The test-harness
for our code is pretty straightforward:

  library ('randomForest');
  data202 - read.csv (random.csv, header=TRUE);
  x- data202[1:5,1:6];
  y- data202[1:5,8];
  y- y[,drop=TRUE];

  x2 - data202[50001:6,1:6];
  y2 - data202[50001:6,8];
  y2 - y2[,drop=TRUE];

  RFobject - randomForest(x,y,na.action=na.roughfix);
  p - predict (RFobject, x2);

In this case, the CSV contains 10 columns, of which 1-6 are numeric in
nature (day of week, week of month, etc...) and column 8 is the target
(sales, a numeric number).

randomForest does fine with the data, our issue is how long it takes.  In
this case, about 5,000 rows of data seems to take just a few seconds, but
going to 50,000 rows doesn't take 5x the time, it takes perhaps 30 or 40
minutes.

We've downloaded and tried RT-Rank, which is a multi-threaded version of
RandomForest, and this seems to produce the same (or slightly better)
predictions, but also gets done fairly quickly.

What can we do to improve the speed of this data computation?  The system
we're on is a dual quad-core Intel CPU @ 2.33Ghz, and with 16GB of RAM ...
we're using the stock R RPM for CentOS 5.5.

Thanks!

--
Anthony
-- 
View this message in context: 
http://r.789695.n4.nabble.com/randomForest-speed-improvements-tp3172523p3172523.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest speed improvements

2011-01-03 Thread Jonathan P Daily
Have you tried adjusting:
mtry - the number of parameters to try per tree
ntree - the number of trees grown
keep.forest - logical on whether to store tree

Specifically, I found huge improvements in speed by switching keep.forest 
to FALSE in the past when I didn't actually need the forest post analysis.
--
Jonathan P. Daily
Technician - USGS Leetown Science Center
11649 Leetown Road
Kearneysville WV, 25430
(304) 724-4480
Is the room still a room when its empty? Does the room,
 the thing itself have purpose? Or do we, what's the word... imbue it.
 - Jubal Early, Firefly

r-help-boun...@r-project.org wrote on 01/03/2011 02:59:29 PM:

 [image removed] 
 
 [R] randomForest speed improvements
 
 apresley 
 
 to:
 
 r-help
 
 01/03/2011 03:03 PM
 
 Sent by:
 
 r-help-boun...@r-project.org
 
 
 Hi there,
 
 We're trying to use randomForest to do some predictions.  The 
test-harness
 for our code is pretty straightforward:
 
   library ('randomForest');
   data202 - read.csv (random.csv, header=TRUE);
   x- data202[1:5,1:6];
   y- data202[1:5,8];
   y- y[,drop=TRUE];
 
   x2 - data202[50001:6,1:6];
   y2 - data202[50001:6,8];
   y2 - y2[,drop=TRUE];
 
   RFobject - randomForest(x,y,na.action=na.roughfix);
   p - predict (RFobject, x2);
 
 In this case, the CSV contains 10 columns, of which 1-6 are numeric in
 nature (day of week, week of month, etc...) and column 8 is the target
 (sales, a numeric number).
 
 randomForest does fine with the data, our issue is how long it takes. In
 this case, about 5,000 rows of data seems to take just a few seconds, 
but
 going to 50,000 rows doesn't take 5x the time, it takes perhaps 30 or 40
 minutes.
 
 We've downloaded and tried RT-Rank, which is a multi-threaded version of
 RandomForest, and this seems to produce the same (or slightly better)
 predictions, but also gets done fairly quickly.
 
 What can we do to improve the speed of this data computation?  The 
system
 we're on is a dual quad-core Intel CPU @ 2.33Ghz, and with 16GB of RAM 
...
 we're using the stock R RPM for CentOS 5.5.
 
 Thanks!
 
 --
 Anthony
 -- 
 View this message in context: http://r.789695.n4.nabble.com/
 randomForest-speed-improvements-tp3172523p3172523.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest speed improvements

2011-01-03 Thread apresley

I haven't tried changing the mtry or ntree at all ... though I suppose with
only 6 variables, and tens-of-thousands of rows, we can probably do less
than 500 tree's (the default?).

Although tossing the forest does speed things up a bit, seems to be about 15
- 20% faster in some cases, I need to keep the forest to do the prediction,
otherwise, it complains that there is no forest component in the object.

--
Anthony
-- 
View this message in context: 
http://r.789695.n4.nabble.com/randomForest-speed-improvements-tp3172523p3172834.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.