Re: [R] Large dataset + randomForest
Thanks Max, It looks as though that did actually solve the problem! It is a bit of a mystery to me because I think that the 151x150 matrix can't be that big, unless its elements are in turn huge datastructures. (?) I am now calling randomForest() like this: rf <- randomForest(x=df[trainindices,-1],y=df[trainindices,1],xtest=df [testindices,-1],ytest=df[testindices,1], do.trace=5, ntree=500) and it seems to be working just fine. Thanks to all for your help, Florian On 26 Jul 2007, at 19:26, Kuhn, Max wrote: > Florian, > > The first thing that you should change is how you call randomForest. > Instead of specifying the model via a formula, use the randomForest(x, > y) interface. > > When a formula is used, there is a terms object created so that a > model > matrix can be created for these and future observations. That terms > object can get big (I think it would be a matrix of size 151 x 150) > and > is diagonal. > > That might not solve it, but it should help. > > Max > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Florian Nigsch > Sent: Thursday, July 26, 2007 2:07 PM > To: r-help@stat.math.ethz.ch > Subject: [R] Large dataset + randomForest > > [Please CC me in any replies as I am not currently subscribed to the > list. Thanks!] > > Dear all, > > I did a bit of searching on the question of large datasets but did > not come to a definite conclusion. What I am trying to do is the > following: I want to read in a dataset with approx. 100 000 rows and > approx 150 columns. The file size is ~ 33MB, which one would deem not > too big a file for R. To speed up the reading in of the file I do not > use read.table but a loop that does reading with scan() into a buffer > and some preprocessing and then adds the data into a dataframe. > > When I then want to run randomForest() R complains that I cannot > allocate a vector of size 313.0 MB. I am aware that randomForest > needs all data in memory, but > 1) why should that suddenly be 10 times the size of the data (I > acknowedge the need for some internal data of R, but 10 times seems a > bit too much) and > 2) there is still physical memory free on the machine (in total 4GB > available, even though R is limited to 2GB if I correctly remember > the help pages - still 2GB should be enough!) - it doesn't seem to > work either with changed settings done via mem.limits(), or run-time > arguments --min-vsize --max-vsize - what do these have to be set to > to work in my case?? > >> rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) > Error: cannot allocate vector of size 313.0 Mb >> object.size(df)/1024/1024 > [1] 129.5390 > > > Any help would be greatly appreciated, > > Florian > > -- > Florian Nigsch <[EMAIL PROTECTED]> > Unilever Centre for Molecular Sciences Informatics > Department of Chemistry > University of Cambridge > http://www-mitchell.ch.cam.ac.uk/ > Telephone: +44 (0)1223 763 073 > > > > > [[alternative HTML version deleted]] > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- > LEGAL NOTICE > Unless expressly stated otherwise, this message is confidential and > may be privileged. It is intended for the addressee(s) only. > Access to this E-mail by anyone else is unauthorized. If you are > not an addressee, any disclosure or copying of the contents of this > E-mail or any action taken (or not taken) in reliance on it is > unauthorized and may be unlawful. If you are not an addressee, > please inform the sender immediately. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large dataset + randomForest
I compiled the newest R version on a Redhat Linux (uname -a = Linux .cam.ac.uk 2.4.21-50.ELsmp #1 SMP Tue May 8 17:18:29 EDT 2007 i686 i686 i386 GNU/Linux) with 4GB of physical memory. The step when the whole script crashed is within the randomForest() routine, I do know that because I want to time it thus I have it inside a system.time() call. This function exits with the error I posted earlier: > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) Error: cannot allocate vector of size 313.0 Mb When calling gc() directly before I call randomForest() and after I get this: > gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 255416 6.9 899071 24.116800.0818163 21.9 Vcells 17874469 136.4 90854072 693.2 4000.1 269266598 2054.4 > rf <- randomForest(V1 ~ ., data=df, subset=trainindices, do.trace=5) Error: cannot allocate vector of size 626.1 Mb > gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 255441 6.9 899071 24.116800.0818163 21.9 Vcells 17874541 136.4 112037674 854.8 4000.1 269266598 2054.4 > So the only real difference is in the "gc trigger" and the "(Mb)" column next to it. By the way, I am not running it in GUI mode On 27 Jul 2007, at 13:17, jim holtman wrote: > At the max, you had 2GB of memory being used. What operating system > are you running on and how much physical memory do you have on your > system? For windows, there are parameters on the command line to > start RGUI that let you define how much memory can be used. I am not > sure of Linus/UNIX. So you are probably hitting the 2GB max and then > you don't have any more physical memory available. If the computation > is a long script, you might put some 'gc()' statements in the code to > see what section is using the most memory. > > Your problem might have to be broken into parts to run. > > On 7/27/07, Florian Nigsch <[EMAIL PROTECTED]> wrote: >> Hi Jim, >> >> Here is the output of gc() of the same session of R (that I have >> still running...) >> >>> gc() >>used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) >> Ncells 255416 6.9 899071 24.116800.0818163 21.9 >> Vcells 17874469 136.4 113567591 866.5 4000.1 269266598 2054.4 >> >> By increasing the limit of vcells and ncells to 1GB (if that is >> possible?!), would that perhaps solve my problem? >> >> Cheers, >> >> Florian __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large dataset + randomForest
Florian, The first thing that you should change is how you call randomForest. Instead of specifying the model via a formula, use the randomForest(x, y) interface. When a formula is used, there is a terms object created so that a model matrix can be created for these and future observations. That terms object can get big (I think it would be a matrix of size 151 x 150) and is diagonal. That might not solve it, but it should help. Max -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Florian Nigsch Sent: Thursday, July 26, 2007 2:07 PM To: r-help@stat.math.ethz.ch Subject: [R] Large dataset + randomForest [Please CC me in any replies as I am not currently subscribed to the list. Thanks!] Dear all, I did a bit of searching on the question of large datasets but did not come to a definite conclusion. What I am trying to do is the following: I want to read in a dataset with approx. 100 000 rows and approx 150 columns. The file size is ~ 33MB, which one would deem not too big a file for R. To speed up the reading in of the file I do not use read.table but a loop that does reading with scan() into a buffer and some preprocessing and then adds the data into a dataframe. When I then want to run randomForest() R complains that I cannot allocate a vector of size 313.0 MB. I am aware that randomForest needs all data in memory, but 1) why should that suddenly be 10 times the size of the data (I acknowedge the need for some internal data of R, but 10 times seems a bit too much) and 2) there is still physical memory free on the machine (in total 4GB available, even though R is limited to 2GB if I correctly remember the help pages - still 2GB should be enough!) - it doesn't seem to work either with changed settings done via mem.limits(), or run-time arguments --min-vsize --max-vsize - what do these have to be set to to work in my case?? > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) Error: cannot allocate vector of size 313.0 Mb > object.size(df)/1024/1024 [1] 129.5390 Any help would be greatly appreciated, Florian -- Florian Nigsch <[EMAIL PROTECTED]> Unilever Centre for Molecular Sciences Informatics Department of Chemistry University of Cambridge http://www-mitchell.ch.cam.ac.uk/ Telephone: +44 (0)1223 763 073 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Large dataset + randomForest
[Please CC me in any replies as I am not currently subscribed to the list. Thanks!] Dear all, I did a bit of searching on the question of large datasets but did not come to a definite conclusion. What I am trying to do is the following: I want to read in a dataset with approx. 100 000 rows and approx 150 columns. The file size is ~ 33MB, which one would deem not too big a file for R. To speed up the reading in of the file I do not use read.table but a loop that does reading with scan() into a buffer and some preprocessing and then adds the data into a dataframe. When I then want to run randomForest() R complains that I cannot allocate a vector of size 313.0 MB. I am aware that randomForest needs all data in memory, but 1) why should that suddenly be 10 times the size of the data (I acknowedge the need for some internal data of R, but 10 times seems a bit too much) and 2) there is still physical memory free on the machine (in total 4GB available, even though R is limited to 2GB if I correctly remember the help pages - still 2GB should be enough!) - it doesn't seem to work either with changed settings done via mem.limits(), or run-time arguments --min-vsize --max-vsize - what do these have to be set to to work in my case?? > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) Error: cannot allocate vector of size 313.0 Mb > object.size(df)/1024/1024 [1] 129.5390 Any help would be greatly appreciated, Florian -- Florian Nigsch <[EMAIL PROTECTED]> Unilever Centre for Molecular Sciences Informatics Department of Chemistry University of Cambridge http://www-mitchell.ch.cam.ac.uk/ Telephone: +44 (0)1223 763 073 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] large dataset!
Jennifer, we had a little discussion about this topic last May when I had a similar problem. It is archived at http://finzi.psych.upenn.edu/R/Rhelp02a/archive/76401.html You can follow the thread to see the various arguments and solutions. I tried to summarize the possible suggested approachs at http://finzi.psych.upenn.edu/R/Rhelp02a/archive/76583.html HTH, Rogerio Porto. -- Cabeçalho original --- De: [EMAIL PROTECTED] Para: r-help@stat.math.ethz.ch Cópia: Data: Sun, 2 Jul 2006 10:12:25 -0400 (EDT) Assunto: [R] large dataset! > > Hi, I need to analyze data that has 3.5 million observations and > about 60 variables and I was planning on using R to do this but > I can't even seem to read in the data. It just freezes and ties > up the whole system -- and this is on a Linux box purchased about > 6 months ago on a dual-processor PC that was pretty much the top > of the line. I've tried expanding R the memory limits but it > doesn't help. I'll be hugely disappointed if I can't use R b/c > I need to do build tailor-made models (multilevel and other > complexities). My fall-back is the SPlus big data package but > I'd rather avoid if anyone can provide a solution > > Thanks > > Jennifer Hill > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] large dataset!
JENNIFER HILL columbia.edu> writes: > > > Hi, I need to analyze data that has 3.5 million observations and > about 60 variables and I was planning on using R to do this but > I can't even seem to read in the data. It just freezes and ties > up the whole system -- and this is on a Linux box purchased about > 6 months ago on a dual-processor PC that was pretty much the top > of the line. I've tried expanding R the memory limits but it > doesn't help. I'll be hugely disappointed if I can't use R b/c > I need to do build tailor-made models (multilevel and other > complexities). My fall-back is the SPlus big data package but > I'd rather avoid if anyone can provide a solution > > Thanks > > Jennifer Hill > Dear Jennifer, you may want to look at the R newsletters. A few years ago it had an article on using DBMS with R, like MySQL, Oracle, etc. This is a frequently asked question: There are also some posts over the past few years that may be helpful. I have successfully read large database into MySQL, and accessed it from R---it was larger than your database. I hope that helps. Anupam Tyagi. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] large dataset!
Hello Jennifer, I'm writing a package SQLiteDF for Google SOC2006, under the supervision of Prof. Bates & Prof. Riley. Basically, it stores data frame into sqlite databases (i.e. in a file) and aims to be transparently accessible to R using the same operators for ordinary data frames. Right now, it's quite usable (the "indexers" are working, and some other generic methods), and only for linux (I should have the windows package any time soon though). I would love to hear about your requirements so as to test my package. Cheers, M. Manese On 7/3/06, Andrew Robinson <[EMAIL PROTECTED]> wrote: > Jennifer, > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] large dataset!
Jennifer, it sounds like that's too much data for R to hold in your computer's RAM. You should give serious consideration as to whether you need all those data for the models that you're fitting, and if so, whether you need to do them all at once. If not, think about pre-processing steps, using e.g. SQL command, to pull out the data that you need. For example, if the data are spatial, then think about analyzing them by patches. Good luck, Andrew On Sun, Jul 02, 2006 at 10:12:25AM -0400, JENNIFER HILL wrote: > > Hi, I need to analyze data that has 3.5 million observations and > about 60 variables and I was planning on using R to do this but > I can't even seem to read in the data. It just freezes and ties > up the whole system -- and this is on a Linux box purchased about > 6 months ago on a dual-processor PC that was pretty much the top > of the line. I've tried expanding R the memory limits but it > doesn't help. I'll be hugely disappointed if I can't use R b/c > I need to do build tailor-made models (multilevel and other > complexities). My fall-back is the SPlus big data package but > I'd rather avoid if anyone can provide a solution > > Thanks > > Jennifer Hill > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Andrew Robinson Department of Mathematics and StatisticsTel: +61-3-8344-9763 University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599 Email: [EMAIL PROTECTED] http://www.ms.unimelb.edu.au __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] large dataset!
Hi, I need to analyze data that has 3.5 million observations and about 60 variables and I was planning on using R to do this but I can't even seem to read in the data. It just freezes and ties up the whole system -- and this is on a Linux box purchased about 6 months ago on a dual-processor PC that was pretty much the top of the line. I've tried expanding R the memory limits but it doesn't help. I'll be hugely disappointed if I can't use R b/c I need to do build tailor-made models (multilevel and other complexities). My fall-back is the SPlus big data package but I'd rather avoid if anyone can provide a solution Thanks Jennifer Hill __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] large dataset import, aggregation and reshape
Christoph Lehmann a écrit : Dear useRs We have a data-set (comma delimited) with 12Millions of rows, and 5 columns (in fact many more, but we need only 4 of them): id, factor 'a' (5 levels), factor 'b' (15 levels), date-stamp, numeric measurement. We run R on suse-linux 9.1 with 2GB RAM, (and a 3.5GB swap file). on average we have 30 obs. per id. We want to aggregate (eg. sum of the measuresments under each factor-level of 'a' and the same for factor 'b') and reshape the data so that for each id we have only one row in the final data.frame, means finally we have roughly 40 lines. I tried read.delim, used the nrows argument, defined colClasses (with an as.Date class) - memory problems at the latests when calling reshape and aggregate. Also importing the date column as character and then converting the dates column using 'as.Date' didn't succeed. It seems the problematic, memory intesive parts are: a) importing the huge data per se (but the data with dim c(12,5) << 2GB?) b) converting the time-stamp to a 'Date' class c) aggregate and reshape task What are the steps you would recommend? (i) using scan, instead of read.delim (with or without colClasses?) (ii) importing blocks of data (eg 1Million lines once), aggregating them, importing the next block, so on? (iii) putting the data into a MySQL database, importing from there and doing the reshape and aggregation in R for both factors separately thanks for hints from your valuable experience cheers christoph __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html I would try the latter and use and SQL interface such as RODBC or RMySQL. You can send your aggregation and reshape commands to the external database as an SQL query. Example with a database I have at hand. The table "datemesu" has 640,000 rows and 5 columns, the field "mesure" being a factor with 2 levels, "N" and "P". > library(RODBC) > fil <- "C:/Archives/Baobab/Baobab2000.mdb" > chann <- odbcConnectAccess(fil) > quer <- paste("SELECT numani, SUM(IIF(mesure = 'P', 1, 0)) AS wt,", + "SUM(IIF(mesure = 'N', 1, 0)) AS bcs,", + "MIN(date) AS minDate", + "FROM datemesu", + "GROUP BY numani") > system.time(tab <- sqlQuery(chann, quer), gcFirst = TRUE) [1] 11.16 0.19 11.54NANA > odbcCloseAll() > > dim(tab) [1] 69360 4 > head(tab) numani wt bcsminDate 1 SNFLCA1 1 0 1987-01-23 2 SNFLCA2 2 0 1987-01-10 3 SNFLCA4 1 0 1987-01-10 4 SNFLCA6 4 0 1987-02-02 5 SNFLCA7 4 0 1987-02-18 6 SNFLCA8 3 0 1987-01-09 Best, Renaud -- Dr Renaud Lancelot, vétérinaire C/0 Ambassade de France - SCAC BP 834 Antananarivo 101 - Madagascar e-mail: [EMAIL PROTECTED] tel.: +261 32 40 165 53 (cell) +261 20 22 665 36 ext. 225 (work) +261 20 22 494 37 (home) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] large dataset import, aggregation and reshape
Dear useRs We have a data-set (comma delimited) with 12Millions of rows, and 5 columns (in fact many more, but we need only 4 of them): id, factor 'a' (5 levels), factor 'b' (15 levels), date-stamp, numeric measurement. We run R on suse-linux 9.1 with 2GB RAM, (and a 3.5GB swap file). on average we have 30 obs. per id. We want to aggregate (eg. sum of the measuresments under each factor-level of 'a' and the same for factor 'b') and reshape the data so that for each id we have only one row in the final data.frame, means finally we have roughly 40 lines. I tried read.delim, used the nrows argument, defined colClasses (with an as.Date class) - memory problems at the latests when calling reshape and aggregate. Also importing the date column as character and then converting the dates column using 'as.Date' didn't succeed. It seems the problematic, memory intesive parts are: a) importing the huge data per se (but the data with dim c(12,5) << 2GB?) b) converting the time-stamp to a 'Date' class c) aggregate and reshape task What are the steps you would recommend? (i) using scan, instead of read.delim (with or without colClasses?) (ii) importing blocks of data (eg 1Million lines once), aggregating them, importing the next block, so on? (iii) putting the data into a MySQL database, importing from there and doing the reshape and aggregation in R for both factors separately thanks for hints from your valuable experience cheers christoph __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html