Re: [R] Large dataset + randomForest

2007-07-27 Thread Florian Nigsch
Thanks Max,

It looks as though that did actually solve the problem! It is a bit  
of a mystery to me because I think that the 151x150 matrix can't be  
that big, unless its elements are in turn huge datastructures. (?)

I am now calling randomForest() like this:
rf <- randomForest(x=df[trainindices,-1],y=df[trainindices,1],xtest=df 
[testindices,-1],ytest=df[testindices,1], do.trace=5, ntree=500)
and it seems to be working just fine.

Thanks to all for your help,

Florian

On 26 Jul 2007, at 19:26, Kuhn, Max wrote:

> Florian,
>
> The first thing that you should change is how you call randomForest.
> Instead of specifying the model via a formula, use the randomForest(x,
> y) interface.
>
> When a formula is used, there is a terms object created so that a  
> model
> matrix can be created for these and future observations. That terms
> object can get big (I think it would be a matrix of size 151 x 150)  
> and
> is diagonal.
>
> That might not solve it, but it should help.
>
> Max
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Florian Nigsch
> Sent: Thursday, July 26, 2007 2:07 PM
> To: r-help@stat.math.ethz.ch
> Subject: [R] Large dataset + randomForest
>
> [Please CC me in any replies as I am not currently subscribed to the
> list. Thanks!]
>
> Dear all,
>
> I did a bit of searching on the question of large datasets but did
> not come to a definite conclusion. What I am trying to do is the
> following: I want to read in a dataset with approx. 100 000 rows and
> approx 150 columns. The file size is ~ 33MB, which one would deem not
> too big a file for R. To speed up the reading in of the file I do not
> use read.table but a loop that does reading with scan() into a buffer
> and some preprocessing and then adds the data into a dataframe.
>
> When I then want to run randomForest() R complains that I cannot
> allocate a vector of size 313.0 MB. I am aware that randomForest
> needs all data in memory, but
> 1) why should that suddenly be 10 times the size of the data (I
> acknowedge the need for some internal data of R, but 10 times seems a
> bit too much) and
> 2) there is still physical memory free on the machine (in total 4GB
> available, even though R is limited to 2GB if I correctly remember
> the help pages - still 2GB should be enough!) - it doesn't seem to
> work either with changed settings done via mem.limits(), or run-time
> arguments --min-vsize --max-vsize - what do these have to be set to
> to work in my case??
>
>> rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
> Error: cannot allocate vector of size 313.0 Mb
>> object.size(df)/1024/1024
> [1] 129.5390
>
>
> Any help would be greatly appreciated,
>
> Florian
>
> --
> Florian Nigsch <[EMAIL PROTECTED]>
> Unilever Centre for Molecular Sciences Informatics
> Department of Chemistry
> University of Cambridge
> http://www-mitchell.ch.cam.ac.uk/
> Telephone: +44 (0)1223 763 073
>
>
>
>
>   [[alternative HTML version deleted]]
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> --
> LEGAL NOTICE
> Unless expressly stated otherwise, this message is confidential and  
> may be privileged.  It is intended for the addressee(s) only.   
> Access to this E-mail by anyone else is unauthorized.  If you are  
> not an addressee, any disclosure or copying of the contents of this  
> E-mail or any action taken (or not taken) in reliance on it is  
> unauthorized and may be unlawful.  If you are not an addressee,  
> please inform the sender immediately.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large dataset + randomForest

2007-07-27 Thread Florian Nigsch
I compiled the newest R version on a Redhat Linux (uname -a =  
Linux .cam.ac.uk 2.4.21-50.ELsmp #1 SMP Tue May 8 17:18:29 EDT 2007  
i686 i686 i386 GNU/Linux) with 4GB of physical memory. The step when  
the whole script crashed is within the randomForest() routine, I do  
know that because I want to time it thus I have it inside a  
system.time() call. This function exits with the error I posted earlier:

 > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb

When calling gc() directly before I call randomForest() and after I  
get this:

 > gc()
used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
Ncells   255416   6.9 899071  24.116800.0818163   21.9
Vcells 17874469 136.4   90854072 693.2 4000.1 269266598 2054.4
 > rf <- randomForest(V1 ~ ., data=df, subset=trainindices, do.trace=5)
Error: cannot allocate vector of size 626.1 Mb
 > gc()
used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
Ncells   255441   6.9 899071  24.116800.0818163   21.9
Vcells 17874541 136.4  112037674 854.8 4000.1 269266598 2054.4
 >

So the only real difference is in the "gc trigger" and the "(Mb)"  
column next to it. By the way, I am not running it in GUI mode


On 27 Jul 2007, at 13:17, jim holtman wrote:

> At the max, you had 2GB of memory being used.  What operating system
> are you running on and how much physical memory do you have on your
> system?  For windows, there are parameters on the command line to
> start RGUI that let you define how much memory can be used.  I am not
> sure of Linus/UNIX.  So you are probably hitting the 2GB max and then
> you don't have any more physical memory available.  If the computation
> is a long script, you might put some 'gc()' statements in the code to
> see what section is using the most memory.
>
> Your problem might have to be broken into parts to run.
>
> On 7/27/07, Florian Nigsch <[EMAIL PROTECTED]> wrote:
>> Hi Jim,
>>
>> Here is the output of gc() of the same session of R (that I have
>> still running...)
>>
>>> gc()
>>used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
>> Ncells   255416   6.9 899071  24.116800.0818163   21.9
>> Vcells 17874469 136.4  113567591 866.5 4000.1 269266598 2054.4
>>
>> By increasing the limit of vcells and ncells to 1GB (if that is
>> possible?!), would that perhaps solve my problem?
>>
>> Cheers,
>>
>> Florian

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large dataset + randomForest

2007-07-26 Thread Kuhn, Max
Florian,

The first thing that you should change is how you call randomForest.
Instead of specifying the model via a formula, use the randomForest(x,
y) interface.

When a formula is used, there is a terms object created so that a model
matrix can be created for these and future observations. That terms
object can get big (I think it would be a matrix of size 151 x 150) and
is diagonal. 

That might not solve it, but it should help.

Max

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Florian Nigsch
Sent: Thursday, July 26, 2007 2:07 PM
To: r-help@stat.math.ethz.ch
Subject: [R] Large dataset + randomForest

[Please CC me in any replies as I am not currently subscribed to the  
list. Thanks!]

Dear all,

I did a bit of searching on the question of large datasets but did  
not come to a definite conclusion. What I am trying to do is the  
following: I want to read in a dataset with approx. 100 000 rows and  
approx 150 columns. The file size is ~ 33MB, which one would deem not  
too big a file for R. To speed up the reading in of the file I do not  
use read.table but a loop that does reading with scan() into a buffer  
and some preprocessing and then adds the data into a dataframe.

When I then want to run randomForest() R complains that I cannot  
allocate a vector of size 313.0 MB. I am aware that randomForest  
needs all data in memory, but
1) why should that suddenly be 10 times the size of the data (I  
acknowedge the need for some internal data of R, but 10 times seems a  
bit too much) and
2) there is still physical memory free on the machine (in total 4GB  
available, even though R is limited to 2GB if I correctly remember  
the help pages - still 2GB should be enough!) - it doesn't seem to  
work either with changed settings done via mem.limits(), or run-time  
arguments --min-vsize --max-vsize - what do these have to be set to  
to work in my case??

 > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb
 > object.size(df)/1024/1024
[1] 129.5390


Any help would be greatly appreciated,

Florian

--
Florian Nigsch <[EMAIL PROTECTED]>
Unilever Centre for Molecular Sciences Informatics
Department of Chemistry
University of Cambridge
http://www-mitchell.ch.cam.ac.uk/
Telephone: +44 (0)1223 763 073




[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Large dataset + randomForest

2007-07-26 Thread Florian Nigsch
[Please CC me in any replies as I am not currently subscribed to the  
list. Thanks!]

Dear all,

I did a bit of searching on the question of large datasets but did  
not come to a definite conclusion. What I am trying to do is the  
following: I want to read in a dataset with approx. 100 000 rows and  
approx 150 columns. The file size is ~ 33MB, which one would deem not  
too big a file for R. To speed up the reading in of the file I do not  
use read.table but a loop that does reading with scan() into a buffer  
and some preprocessing and then adds the data into a dataframe.

When I then want to run randomForest() R complains that I cannot  
allocate a vector of size 313.0 MB. I am aware that randomForest  
needs all data in memory, but
1) why should that suddenly be 10 times the size of the data (I  
acknowedge the need for some internal data of R, but 10 times seems a  
bit too much) and
2) there is still physical memory free on the machine (in total 4GB  
available, even though R is limited to 2GB if I correctly remember  
the help pages - still 2GB should be enough!) - it doesn't seem to  
work either with changed settings done via mem.limits(), or run-time  
arguments --min-vsize --max-vsize - what do these have to be set to  
to work in my case??

 > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb
 > object.size(df)/1024/1024
[1] 129.5390


Any help would be greatly appreciated,

Florian

--
Florian Nigsch <[EMAIL PROTECTED]>
Unilever Centre for Molecular Sciences Informatics
Department of Chemistry
University of Cambridge
http://www-mitchell.ch.cam.ac.uk/
Telephone: +44 (0)1223 763 073




[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] large dataset!

2006-07-03 Thread rdporto1
Jennifer,

we had a little discussion about this topic
last May when I had a similar problem.
It is archived at

http://finzi.psych.upenn.edu/R/Rhelp02a/archive/76401.html

You can follow the thread to see the various 
arguments and solutions. I tried to summarize
the possible suggested approachs at

http://finzi.psych.upenn.edu/R/Rhelp02a/archive/76583.html

HTH,

Rogerio Porto.

-- Cabeçalho original ---

De: [EMAIL PROTECTED]
Para: r-help@stat.math.ethz.ch
Cópia: 
Data: Sun, 2 Jul 2006 10:12:25 -0400 (EDT)
Assunto: [R] large dataset!

> 
> Hi, I need to analyze data that has 3.5 million observations and
> about 60 variables and I was planning on using R to do this but
> I can't even seem to read in the data.  It just freezes and ties
> up the whole system -- and this is on a Linux box purchased about
> 6 months ago on a dual-processor PC that was pretty much the top
> of the line.  I've tried expanding R the memory limits but it 
> doesn't help.  I'll be hugely disappointed if I can't use R b/c
> I need to do build tailor-made models (multilevel and other 
> complexities).   My fall-back is the SPlus big data package but
> I'd rather avoid if anyone can provide a solution
> 
> Thanks
> 
> Jennifer Hill
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] large dataset!

2006-07-03 Thread Anupam Tyagi
JENNIFER HILL  columbia.edu> writes:

> 
> 
> Hi, I need to analyze data that has 3.5 million observations and
> about 60 variables and I was planning on using R to do this but
> I can't even seem to read in the data.  It just freezes and ties
> up the whole system -- and this is on a Linux box purchased about
> 6 months ago on a dual-processor PC that was pretty much the top
> of the line.  I've tried expanding R the memory limits but it 
> doesn't help.  I'll be hugely disappointed if I can't use R b/c
> I need to do build tailor-made models (multilevel and other 
> complexities).   My fall-back is the SPlus big data package but
> I'd rather avoid if anyone can provide a solution
> 
> Thanks
> 
> Jennifer Hill
> 
Dear Jennifer, you may want to look at the R newsletters. A few years ago it had
an article on using DBMS with R, like MySQL, Oracle, etc. This is a frequently
asked question: There are also some posts over the past few years that may be
helpful. I have successfully read large database into MySQL, and accessed it
from R---it was larger than your database. I hope that helps. Anupam Tyagi.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] large dataset!

2006-07-02 Thread miguel manese
Hello Jennifer,

I'm writing a package SQLiteDF for Google SOC2006, under the
supervision of Prof. Bates & Prof. Riley. Basically, it stores data
frame into sqlite databases (i.e. in a file) and aims to be
transparently accessible to R using the same operators for ordinary
data frames.

Right now, it's quite usable (the "indexers" are working, and some
other generic methods), and only for linux (I should have the windows
package any time soon though). I would love to hear about your
requirements so as to test my package.

Cheers,
M. Manese

On 7/3/06, Andrew Robinson <[EMAIL PROTECTED]> wrote:
> Jennifer,
>


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] large dataset!

2006-07-02 Thread Andrew Robinson
Jennifer,

it sounds like that's too much data for R to hold in your computer's
RAM. You should give serious consideration as to whether you need all
those data for the models that you're fitting, and if so, whether you
need to do them all at once.  If not, think about pre-processing
steps, using e.g. SQL command, to pull out the data that you need. For
example, if the data are spatial, then think about analyzing them by
patches.  

Good luck,

Andrew

On Sun, Jul 02, 2006 at 10:12:25AM -0400, JENNIFER HILL wrote:
> 
> Hi, I need to analyze data that has 3.5 million observations and
> about 60 variables and I was planning on using R to do this but
> I can't even seem to read in the data.  It just freezes and ties
> up the whole system -- and this is on a Linux box purchased about
> 6 months ago on a dual-processor PC that was pretty much the top
> of the line.  I've tried expanding R the memory limits but it 
> doesn't help.  I'll be hugely disappointed if I can't use R b/c
> I need to do build tailor-made models (multilevel and other 
> complexities).   My fall-back is the SPlus big data package but
> I'd rather avoid if anyone can provide a solution
> 
> Thanks
> 
> Jennifer Hill
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
Andrew Robinson  
Department of Mathematics and StatisticsTel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599
Email: [EMAIL PROTECTED] http://www.ms.unimelb.edu.au

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] large dataset!

2006-07-02 Thread JENNIFER HILL

Hi, I need to analyze data that has 3.5 million observations and
about 60 variables and I was planning on using R to do this but
I can't even seem to read in the data.  It just freezes and ties
up the whole system -- and this is on a Linux box purchased about
6 months ago on a dual-processor PC that was pretty much the top
of the line.  I've tried expanding R the memory limits but it 
doesn't help.  I'll be hugely disappointed if I can't use R b/c
I need to do build tailor-made models (multilevel and other 
complexities).   My fall-back is the SPlus big data package but
I'd rather avoid if anyone can provide a solution

Thanks

Jennifer Hill

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] large dataset import, aggregation and reshape

2005-04-24 Thread Renaud Lancelot
Christoph Lehmann a écrit :
Dear useRs
We have a data-set (comma delimited) with 12Millions of rows, and 5 
columns (in fact many more, but we need only 4 of them): id, factor 'a' 
(5 levels), factor 'b' (15 levels), date-stamp, numeric measurement. We 
run R on suse-linux 9.1 with 2GB RAM, (and a 3.5GB swap file).

on average we have 30 obs. per id. We want to aggregate (eg. sum of the 
measuresments under each factor-level of 'a' and the same for factor 
'b') and reshape the data so that for each id we have only one row in 
the final data.frame, means finally we have roughly 40 lines.

I tried read.delim, used the nrows argument, defined colClasses (with an 
as.Date class) - memory problems at the latests when calling reshape and 
aggregate. Also importing the date column as character and then 
converting the dates column using 'as.Date' didn't succeed.

It seems the problematic, memory intesive parts are:
a) importing the huge data per se (but the data with dim c(12,5) << 2GB?)
b) converting the time-stamp to a 'Date' class
c) aggregate and reshape task
What are the steps you would recommend?
(i) using scan, instead of read.delim (with or without colClasses?)
(ii) importing blocks of data (eg 1Million lines once), aggregating 
them, importing the next block, so on?
(iii) putting the data into a MySQL database, importing from there and 
doing the reshape and aggregation in R for both factors separately

thanks for hints from your valuable experience
cheers
christoph
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

I would try the latter and use and SQL interface such as RODBC or 
RMySQL. You can send your aggregation and reshape commands to the 
external database as an SQL query.

Example with a database I have at hand. The table "datemesu" has 640,000 
rows and 5 columns, the field "mesure" being a factor with 2 levels, "N" 
and "P".

> library(RODBC)
> fil <- "C:/Archives/Baobab/Baobab2000.mdb"
> chann <- odbcConnectAccess(fil)
> quer <- paste("SELECT numani, SUM(IIF(mesure = 'P', 1, 0)) AS wt,",
+  "SUM(IIF(mesure = 'N', 1, 0)) AS bcs,",
+  "MIN(date) AS minDate",
+   "FROM datemesu",
+   "GROUP BY numani")
> system.time(tab <- sqlQuery(chann, quer), gcFirst = TRUE)
[1] 11.16  0.19 11.54NANA
> odbcCloseAll()
>
> dim(tab)
[1] 69360 4
> head(tab)
   numani wt bcsminDate
1 SNFLCA1  1   0 1987-01-23
2 SNFLCA2  2   0 1987-01-10
3 SNFLCA4  1   0 1987-01-10
4 SNFLCA6  4   0 1987-02-02
5 SNFLCA7  4   0 1987-02-18
6 SNFLCA8  3   0 1987-01-09
Best,
Renaud
--
Dr Renaud Lancelot, vétérinaire
C/0 Ambassade de France - SCAC
BP 834 Antananarivo 101 - Madagascar
e-mail: [EMAIL PROTECTED]
tel.:   +261 32 40 165 53 (cell)
+261 20 22 665 36 ext. 225 (work)
+261 20 22 494 37 (home)
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] large dataset import, aggregation and reshape

2005-04-24 Thread Christoph Lehmann
Dear useRs
We have a data-set (comma delimited) with 12Millions of rows, and 5 
columns (in fact many more, but we need only 4 of them): id, factor 'a' 
(5 levels), factor 'b' (15 levels), date-stamp, numeric measurement. We 
run R on suse-linux 9.1 with 2GB RAM, (and a 3.5GB swap file).

on average we have 30 obs. per id. We want to aggregate (eg. sum of the 
measuresments under each factor-level of 'a' and the same for factor 
'b') and reshape the data so that for each id we have only one row in 
the final data.frame, means finally we have roughly 40 lines.

I tried read.delim, used the nrows argument, defined colClasses (with an 
as.Date class) - memory problems at the latests when calling reshape and 
aggregate. Also importing the date column as character and then 
converting the dates column using 'as.Date' didn't succeed.

It seems the problematic, memory intesive parts are:
a) importing the huge data per se (but the data with dim c(12,5) << 2GB?)
b) converting the time-stamp to a 'Date' class
c) aggregate and reshape task
What are the steps you would recommend?
(i) using scan, instead of read.delim (with or without colClasses?)
(ii) importing blocks of data (eg 1Million lines once), aggregating 
them, importing the next block, so on?
(iii) putting the data into a MySQL database, importing from there and 
doing the reshape and aggregation in R for both factors separately

thanks for hints from your valuable experience
cheers
christoph
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html