Re: [R] RandomForest, Party and Memory Management
On Sun, 3 Feb 2013, Lorenzo Isella wrote: Dear All, For a data mining project, I am relying heavily on the RandomForest and Party packages. Due to the large size of the data set, I have often memory problems (in particular with the Party package; RandomForest seems to use less memory). I really have two questions at this point 1) Please see how I am using the Party and RandomForest packages. Any comment is welcome and useful. myparty - cforest(SalePrice ~ ModelID+ ProductGroup+ ProductGroupDesc+MfgYear+saledate3+saleday+ salemonth, data = trainRF, control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE)) rf_model - randomForest(SalePrice ~ ModelID+ ProductGroup+ ProductGroupDesc+MfgYear+saledate3+saleday+ salemonth, data = trainRF,na.action = na.omit, importance=TRUE, do.trace=100, mtry=3,ntree=300) 2) I have another question: sometimes R crashes after telling me that it is unable to allocate e.g. an array of 1.5 Gb. Do not use the word 'crash': see the posting guide. I suspect it gives you an error message. However, I have 4Gb of ram on my box, so...technically the memory is there, but is there a way to enable R to use more of it? Yes. I am surmising this is Windows but you have not told us so. See the rw-FAQ. The real answer is to run a 64-bit OS: your computer may have 4GB of RAM, but your OS has a 2GB address space which could be raised to 3GB. Many thanks Lorenzo -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RandomForest, Party and Memory Management
Dear Dennis and dear All, It was probably not my best post. I am running R on a Debian box (amd64 architecture) and that is why I was surprised to see memory issues when dealing with a vector larger than 1Gb. The memory is there, but probably it is not contiguous. I will investigate into the matter and post again (generating an artificial dataframe if needed). Many thanks Lorenzo On 4 February 2013 00:50, Dennis Murphy djmu...@gmail.com wrote: Hi Lorenzo: On Sun, Feb 3, 2013 at 11:47 AM, Lorenzo Isella lorenzo.ise...@gmail.com wrote: Dear All, For a data mining project, I am relying heavily on the RandomForest and Party packages. Due to the large size of the data set, I have often memory problems (in particular with the Party package; RandomForest seems to use less memory). I really have two questions at this point 1) Please see how I am using the Party and RandomForest packages. Any comment is welcome and useful. As noted elsewhere, the example is not reproducible so I can't help you there. myparty - cforest(SalePrice ~ ModelID+ ProductGroup+ ProductGroupDesc+MfgYear+saledate3+saleday+ salemonth, data = trainRF, control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE)) rf_model - randomForest(SalePrice ~ ModelID+ ProductGroup+ ProductGroupDesc+MfgYear+saledate3+saleday+ salemonth, data = trainRF,na.action = na.omit, importance=TRUE, do.trace=100, mtry=3,ntree=300) 2) I have another question: sometimes R crashes after telling me that it is unable to allocate e.g. an array of 1.5 Gb. However, I have 4Gb of ram on my box, so...technically the memory is there, but is there a way to enable R to use more of it? 4Gb is not a lot of RAM for data mining projects. I have twice that and run into memory limits on some fairly simple tasks (e.g., 2D tables) in large simulations with 1M or 10M runs. Part of the problem is that data is often copied, sometimes more than once. If you have a 1Gb input data frame, three copies and you're out of space. Moreover, copied objects need contiguous memory, and this becomes very difficult to achieve with large objects and limited RAM. With 4Gb RAM, you need to be more clever: * eliminate as many other processes that access RAM as possible (e.g., no active browser) * think of ways to process your data in chunks (which is harder to do when the objective is model fitting) * type ?Memory-limits (including the quotes) at the console for explanations about memory limits and a few places to look for potential solutions * look into 'big data' packages like ff or bigmemory, among others * if you're in an (American ?) academic institution, you can get a free license for Revolution R, which is supposed to be better for big data problems than vanilla R It's hard to be specific about potential solutions, but the above should broaden your perspective on the big data problem and possible avenues for solving it. Dennis Many thanks Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RandomForest, Party and Memory Management
Neither of your questions meets the Posting Guidelines (see footer of any email). 1) Not reproducible. [1] 2) Very operating-system specific and a FAQ. You have not indicated what your OS is (via sessionInfo), nor what reading you have done to address memory problems already (use a search engine... or begin with the FAQs in R help or on CRAN). [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Lorenzo Isella lorenzo.ise...@gmail.com wrote: Dear All, For a data mining project, I am relying heavily on the RandomForest and Party packages. Due to the large size of the data set, I have often memory problems (in particular with the Party package; RandomForest seems to use less memory). I really have two questions at this point 1) Please see how I am using the Party and RandomForest packages. Any comment is welcome and useful. myparty - cforest(SalePrice ~ ModelID+ ProductGroup+ ProductGroupDesc+MfgYear+saledate3+saleday+ salemonth, data = trainRF, control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE)) rf_model - randomForest(SalePrice ~ ModelID+ ProductGroup+ ProductGroupDesc+MfgYear+saledate3+saleday+ salemonth, data = trainRF,na.action = na.omit, importance=TRUE, do.trace=100, mtry=3,ntree=300) 2) I have another question: sometimes R crashes after telling me that it is unable to allocate e.g. an array of 1.5 Gb. However, I have 4Gb of ram on my box, so...technically the memory is there, but is there a way to enable R to use more of it? Many thanks Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.