Re: [R] randomForest memory footprint

2011-09-08 Thread Liaw, Andy
It looks like you are building a regression model.  With such a large number of 
rows, you should try to limit the size of the trees by setting nodesize to 
something larger than the default (5).  The issue, I suspect, is the fact that 
the size of the largest possible tree has about 2*nodesize nodes, and each node 
takes a row in a matrix to store.  Multiply that by the number of trees you are 
trying to build, and you see how the memory can be gobbled up quickly.  Boosted 
trees don't usually run into this problem because one usually boost very small 
trees (usually no more than 10 terminal nodes per tree).

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman
 Sent: Wednesday, September 07, 2011 2:46 PM
 To: r-help@r-project.org
 Subject: [R] randomForest memory footprint
 
 Hello, I am attempting to train a random forest model using the
 randomForest package on 500,000 rows and 8 columns (7 predictors, 1
 response). The data set is the first block of data from the UCI
 Machine Learning Repo dataset Record Linkage Comparison Patterns
 with the slight modification that I dropped two columns with lots of
 NA's and I used knn imputation to fill in other gaps.
 
 When I load in my dataset, R uses no more than 100 megs of RAM. I'm
 running a 64-bit R with ~4 gigs of RAM available. When I execute the
 randomForest() function, however I get memory complaints. Example:
 
  summary(mydata1.clean[,3:10])
   cmp_fname_c1 cmp_lname_c1   cmp_sex   cmp_bd
   cmp_bm   cmp_by  cmp_plz is_match
  Min.   :0.   Min.   :0.   Min.   :0.   Min.   :0.
 Min.   :0.   Min.   :0.   Min.   :0.0   FALSE:572820
  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.   1st Qu.:0.
 1st Qu.:0.   1st Qu.:0.   1st Qu.:0.0   TRUE :  2093
  Median :1.   Median :0.1818   Median :1.   Median :0.
 Median :0.   Median :0.   Median :0.0
  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
 Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
  3rd Qu.:1.   3rd Qu.:0.4286   3rd Qu.:1.   3rd Qu.:0.
 3rd Qu.:1.   3rd Qu.:0.   3rd Qu.:0.0
  Max.   :1.   Max.   :1.   Max.   :1.   Max.   :1.
 Max.   :1.   Max.   :1.   Max.   :1.0
  mydata1.rf.model2 - randomForest(x = 
 mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
 Error: cannot allocate vector of size 877.2 Mb
 In addition: Warning messages:
 1: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 2: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 3: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 4: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 
 Other techniques such as boosted trees handle the data size just fine.
 Are there any parameters I can adjust such that I can use a value of
 100 or more for ntree?
 
 Thanks,
 John
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest memory footprint

2011-09-08 Thread John Foreman
I set maxnodes and nodesize to reasonable levels and everything is
working great now. Thanks for the guidance.

v/r,
John

On Thu, Sep 8, 2011 at 8:58 AM, Liaw, Andy andy_l...@merck.com wrote:
 It looks like you are building a regression model.  With such a large number 
 of rows, you should try to limit the size of the trees by setting nodesize to 
 something larger than the default (5).  The issue, I suspect, is the fact 
 that the size of the largest possible tree has about 2*nodesize nodes, and 
 each node takes a row in a matrix to store.  Multiply that by the number of 
 trees you are trying to build, and you see how the memory can be gobbled up 
 quickly.  Boosted trees don't usually run into this problem because one 
 usually boost very small trees (usually no more than 10 terminal nodes per 
 tree).

 Best,
 Andy

 -Original Message-
 From: r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman
 Sent: Wednesday, September 07, 2011 2:46 PM
 To: r-help@r-project.org
 Subject: [R] randomForest memory footprint

 Hello, I am attempting to train a random forest model using the
 randomForest package on 500,000 rows and 8 columns (7 predictors, 1
 response). The data set is the first block of data from the UCI
 Machine Learning Repo dataset Record Linkage Comparison Patterns
 with the slight modification that I dropped two columns with lots of
 NA's and I used knn imputation to fill in other gaps.

 When I load in my dataset, R uses no more than 100 megs of RAM. I'm
 running a 64-bit R with ~4 gigs of RAM available. When I execute the
 randomForest() function, however I get memory complaints. Example:

  summary(mydata1.clean[,3:10])
   cmp_fname_c1     cmp_lname_c1       cmp_sex           cmp_bd
   cmp_bm           cmp_by          cmp_plz         is_match
  Min.   :0.   Min.   :0.   Min.   :0.   Min.   :0.
 Min.   :0.   Min.   :0.   Min.   :0.0   FALSE:572820
  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.   1st Qu.:0.
 1st Qu.:0.   1st Qu.:0.   1st Qu.:0.0   TRUE :  2093
  Median :1.   Median :0.1818   Median :1.   Median :0.
 Median :0.   Median :0.   Median :0.0
  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
 Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
  3rd Qu.:1.   3rd Qu.:0.4286   3rd Qu.:1.   3rd Qu.:0.
 3rd Qu.:1.   3rd Qu.:0.   3rd Qu.:0.0
  Max.   :1.   Max.   :1.   Max.   :1.   Max.   :1.
 Max.   :1.   Max.   :1.   Max.   :1.0
  mydata1.rf.model2 - randomForest(x =
 mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
 Error: cannot allocate vector of size 877.2 Mb
 In addition: Warning messages:
 1: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 2: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 3: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 4: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)

 Other techniques such as boosted trees handle the data size just fine.
 Are there any parameters I can adjust such that I can use a value of
 100 or more for ntree?

 Thanks,
 John

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 Notice:  This e-mail message, together with any attachments, contains
 information of Merck  Co., Inc. (One Merck Drive, Whitehouse Station,
 New Jersey, USA 08889), and/or its affiliates Direct contact information
 for affiliates is available at
 http://www.merck.com/contact/contacts.html) that may be confidential,
 proprietary copyrighted and/or legally privileged. It is intended solely
 for the use of the individual or entity named on this message. If you are
 not the intended recipient, and have received this message in error,
 please notify us immediately by reply e-mail and then delete it from
 your system.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.