I set maxnodes and nodesize to reasonable levels and everything is
working great now. Thanks for the guidance.
v/r,
John
On Thu, Sep 8, 2011 at 8:58 AM, Liaw, Andy andy_l...@merck.com wrote:
It looks like you are building a regression model. With such a large number
of rows, you should try to limit the size of the trees by setting nodesize to
something larger than the default (5). The issue, I suspect, is the fact
that the size of the largest possible tree has about 2*nodesize nodes, and
each node takes a row in a matrix to store. Multiply that by the number of
trees you are trying to build, and you see how the memory can be gobbled up
quickly. Boosted trees don't usually run into this problem because one
usually boost very small trees (usually no more than 10 terminal nodes per
tree).
Best,
Andy
-Original Message-
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman
Sent: Wednesday, September 07, 2011 2:46 PM
To: r-help@r-project.org
Subject: [R] randomForest memory footprint
Hello, I am attempting to train a random forest model using the
randomForest package on 500,000 rows and 8 columns (7 predictors, 1
response). The data set is the first block of data from the UCI
Machine Learning Repo dataset Record Linkage Comparison Patterns
with the slight modification that I dropped two columns with lots of
NA's and I used knn imputation to fill in other gaps.
When I load in my dataset, R uses no more than 100 megs of RAM. I'm
running a 64-bit R with ~4 gigs of RAM available. When I execute the
randomForest() function, however I get memory complaints. Example:
summary(mydata1.clean[,3:10])
cmp_fname_c1 cmp_lname_c1 cmp_sex cmp_bd
cmp_bm cmp_by cmp_plz is_match
Min. :0. Min. :0. Min. :0. Min. :0.
Min. :0. Min. :0. Min. :0.0 FALSE:572820
1st Qu.:0.2857 1st Qu.:0.1000 1st Qu.:1. 1st Qu.:0.
1st Qu.:0. 1st Qu.:0. 1st Qu.:0.0 TRUE : 2093
Median :1. Median :0.1818 Median :1. Median :0.
Median :0. Median :0. Median :0.0
Mean :0.7127 Mean :0.3156 Mean :0.9551 Mean :0.2247
Mean :0.4886 Mean :0.2226 Mean :0.00549
3rd Qu.:1. 3rd Qu.:0.4286 3rd Qu.:1. 3rd Qu.:0.
3rd Qu.:1. 3rd Qu.:0. 3rd Qu.:0.0
Max. :1. Max. :1. Max. :1. Max. :1.
Max. :1. Max. :1. Max. :1.0
mydata1.rf.model2 - randomForest(x =
mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
Error: cannot allocate vector of size 877.2 Mb
In addition: Warning messages:
1: In dim(data) - dim :
Reached total allocation of 3992Mb: see help(memory.size)
2: In dim(data) - dim :
Reached total allocation of 3992Mb: see help(memory.size)
3: In dim(data) - dim :
Reached total allocation of 3992Mb: see help(memory.size)
4: In dim(data) - dim :
Reached total allocation of 3992Mb: see help(memory.size)
Other techniques such as boosted trees handle the data size just fine.
Are there any parameters I can adjust such that I can use a value of
100 or more for ntree?
Thanks,
John
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice: This e-mail message, together with any attachments, contains
information of Merck Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.