[R] RandomForests Limitations? Work Arounds?

2010-09-07 Thread Michael Lindgren
Greetings,

I want to inquire about the memory limitations of the randomForest package.
 I am attempting to perform clustering analysis using RF but I keep getting
the message that RF cannot allocate a vector of a given size.  I am
currently using the 32-bit version of R to run this analysis,  are there
fewer memory issues when using the 64-bit version of R?  Mainly I want to be
able to run RF on a very large dataset, but keep having to take very small
sample sizes to do so.  Any advice is more than appreciated.

Best,

Michael

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RandomForests Limitations? Work Arounds?

2010-09-07 Thread Liaw, Andy
You're not giving us much to go on, so the info I can give is
correspondingly vague.

I take it you are using RF in unsupervised mode.  What RF does in this
case is simply generate a second part of the data that have the same
marginal distribution as the data you have, but the variables are
independent.  It then runs classification treating your data as one
class and the generated data as the other class.  The output is the
proximity matrix, which you can use as the similarity matrix for
clustering.

Given that, you know that RF has to basically use twice as much memory
to store the data.  That's one place where it can take lots of memory.
The second place is the storage of the proximity matrix itself:  If you
have n rows in your data, the proximity matrix is n by n.  For moderate
n this is going to be the part that takes up lots of memory.

Just in case you haven't seen/heard: avoid the formula interface (i.e.,
randomForest(~., data=mydata, ...) because that can really soak up
memory.

Yes, 64-bit OS and 64-bit R can help, but only if you have the RAM to
take advantage of the platform. 

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Lindgren
 Sent: Tuesday, September 07, 2010 4:28 PM
 To: r-help@r-project.org
 Subject: [R] RandomForests Limitations? Work Arounds?
 
 Greetings,
 
 I want to inquire about the memory limitations of the 
 randomForest package.
  I am attempting to perform clustering analysis using RF but 
 I keep getting
 the message that RF cannot allocate a vector of a given size.  I am
 currently using the 32-bit version of R to run this analysis, 
  are there
 fewer memory issues when using the 64-bit version of R?  
 Mainly I want to be
 able to run RF on a very large dataset, but keep having to 
 take very small
 sample sizes to do so.  Any advice is more than appreciated.
 
 Best,
 
 Michael
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.