Re: [R] RandomForest, Party and Memory Management

2013-02-04 Thread Prof Brian Ripley

On Sun, 3 Feb 2013, Lorenzo Isella wrote:


Dear All,
For a data mining project, I am relying heavily on the RandomForest and Party 
packages.
Due to the large size of the data set, I have often memory problems (in 
particular with the Party package; RandomForest seems to use less memory). I 
really have two questions at this point
1) Please see how I am using the Party and RandomForest packages. Any comment 
is welcome and useful.




myparty - cforest(SalePrice ~ ModelID+
 ProductGroup+
 ProductGroupDesc+MfgYear+saledate3+saleday+
 salemonth,
 data = trainRF,
control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE))




rf_model - randomForest(SalePrice ~ ModelID+
  ProductGroup+
  ProductGroupDesc+MfgYear+saledate3+saleday+
  salemonth,
  data = trainRF,na.action = na.omit,
 importance=TRUE, do.trace=100, mtry=3,ntree=300)

2) I have another question: sometimes R crashes after telling me that it is 
unable to allocate e.g. an array of 1.5 Gb.


Do not use the word 'crash': see the posting guide.  I suspect it 
gives you an error message.


However, I have 4Gb of ram on my box, so...technically the memory is there, 
but is there a way to enable R to use more of it?


Yes.  I am surmising this is Windows but you have not told us so. 
See the rw-FAQ.  The real answer is to run a 64-bit OS: your computer 
may have 4GB of RAM, but your OS has a 2GB address space which could 
be raised to 3GB.




Many thanks

Lorenzo



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RandomForest, Party and Memory Management

2013-02-04 Thread Lorenzo Isella
Dear Dennis and dear All,
It was probably not my best post.
I am running R on a Debian box (amd64 architecture) and that is why I
was surprised to see memory issues when dealing with a vector larger
than 1Gb. The memory is there, but probably it is not contiguous.
I will investigate into the matter and post again (generating an
artificial dataframe if needed).
Many thanks

Lorenzo

On 4 February 2013 00:50, Dennis Murphy djmu...@gmail.com wrote:
 Hi Lorenzo:

 On Sun, Feb 3, 2013 at 11:47 AM, Lorenzo Isella
 lorenzo.ise...@gmail.com wrote:
 Dear All,
 For a data mining project, I am relying heavily on the RandomForest and
 Party packages.
 Due to the large size of the data set, I have often memory problems (in
 particular with the Party package; RandomForest seems to use less memory). I
 really have two questions at this point
 1) Please see how I am using the Party and RandomForest packages. Any
 comment is welcome and useful.

 As noted elsewhere, the example is not reproducible so I can't help you there.



 myparty - cforest(SalePrice ~ ModelID+
ProductGroup+
ProductGroupDesc+MfgYear+saledate3+saleday+
salemonth,
data = trainRF,
 control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE))




 rf_model - randomForest(SalePrice ~ ModelID+
 ProductGroup+
 ProductGroupDesc+MfgYear+saledate3+saleday+
 salemonth,
 data = trainRF,na.action = na.omit,
importance=TRUE, do.trace=100, mtry=3,ntree=300)

 2) I have another question: sometimes R crashes after telling me that it is
 unable to allocate e.g. an array of 1.5 Gb.
 However, I have 4Gb of ram on my box, so...technically the memory is there,
 but is there a way to enable R to use more of it?

 4Gb is not a lot of RAM for data mining projects. I have twice that
 and run into memory limits on some fairly simple tasks (e.g., 2D
 tables) in large simulations with 1M or 10M runs. Part of the problem
 is that data is often copied, sometimes more than once. If you have a
 1Gb input data frame, three copies and you're out of space. Moreover,
 copied objects need contiguous memory, and this becomes very difficult
 to achieve with large objects and limited RAM. With 4Gb RAM, you need
 to be more clever:

 * eliminate as many other processes that access RAM as possible (e.g.,
 no active browser)
 * think of ways to process your data in chunks (which is harder to do
 when the objective is model fitting)
 * type ?Memory-limits  (including the quotes) at the console for
 explanations about memory limits and a few places to look for
 potential solutions
 * look into 'big data' packages like ff or bigmemory, among others
 * if you're in an (American ?) academic institution, you can get a
 free license for Revolution R, which is supposed to be better for big
 data problems than vanilla R

 It's hard to be specific about potential solutions, but the above
 should broaden your perspective on the big data problem and possible
 avenues for solving it.

 Dennis

 Many thanks

 Lorenzo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RandomForest, Party and Memory Management

2013-02-03 Thread Jeff Newmiller
Neither of your questions meets the Posting Guidelines (see footer of any 
email).
1) Not reproducible. [1]
2) Very operating-system specific and a FAQ. You have not indicated what your 
OS is (via sessionInfo), nor what reading you have done to address memory 
problems already (use a search engine... or begin with the FAQs in R help or on 
CRAN).

[1] 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---
Jeff NewmillerThe .   .  Go Live...
DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

Lorenzo Isella lorenzo.ise...@gmail.com wrote:

Dear All,
For a data mining project, I am relying heavily on the RandomForest and
 
Party packages.
Due to the large size of the data set, I have often memory problems (in
 
particular with the Party package; RandomForest seems to use less
memory).  
I really have two questions at this point
1) Please see how I am using the Party and RandomForest packages. Any  
comment is welcome and useful.



myparty - cforest(SalePrice ~ ModelID+
ProductGroup+
ProductGroupDesc+MfgYear+saledate3+saleday+
salemonth,
data = trainRF,
control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE))




rf_model - randomForest(SalePrice ~ ModelID+
 ProductGroup+
 ProductGroupDesc+MfgYear+saledate3+saleday+
 salemonth,
 data = trainRF,na.action = na.omit,
importance=TRUE, do.trace=100, mtry=3,ntree=300)

2) I have another question: sometimes R crashes after telling me that
it  
is unable to allocate e.g. an array of 1.5 Gb.
However, I have 4Gb of ram on my box, so...technically the memory is  
there, but is there a way to enable R to use more of it?

Many thanks

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.