Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]

2006-12-21 Thread Martin Morgan
Section 8 of the Installation and Administration guide says that on
64-bit architectures the 'size of a block of memory allocated is
limited to 2^32-1 (8 GB) bytes'.

The wording 'a block of memory' here is important, because this sets a
limit on a single allocation rather than the memory consumed by an R
session. The size of the allocation of the original poster was
something like 300,000 SNPs x 1000 individuals x 8 bytes (depending on
representation, I guess) = about 2.3 GB so there is still some room
for even larger data.

Obviously it's important to think carefully about how the statistical
analysis of such a large volume of data will proceed, and be
interpreted.

Martin

Thomas Lumley <[EMAIL PROTECTED]> writes:

> On Thu, 21 Dec 2006, Iris Kolder wrote:
>
>> Thank you all for your help!
>>
>> So with all your suggestions we will try to run it on a computer with a 
>> 64 bits proccesor. But i've been told that the new R versions all work 
>> on a 32bits processor. I read in other posts that only the old R 
>> versions were capable of larger data sets and were running under 64 bit 
>> proccesors. I also read that they are adapting the new R version for 64 
>> bits proccesors again so does anyone now if there is a version available 
>> that we could use?
>
> Huh?  R 2.4.x runs perfectly happily accessing large memory under Linux on 
> 64bit processors (and Solaris, and probably others). I think it even works 
> on Mac OS X now.
>
> For example:
>> x<-rnorm(1e9)
>> gc()
>   used   (Mb) gc trigger   (Mb)   max used   (Mb)
> Ncells 222881   12.0 467875   25.0 35   18.7
> Vcells 1000115046 7630.3 1000475743 7633.1 1000115558 7630.3
>
>
>  -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
> [EMAIL PROTECTED] University of Washington, Seattle
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin T. Morgan
Bioconductor / Computational Biology
http://bioconductor.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]

2006-12-21 Thread Thomas Lumley
On Thu, 21 Dec 2006, Iris Kolder wrote:

> Thank you all for your help!
>
> So with all your suggestions we will try to run it on a computer with a 
> 64 bits proccesor. But i've been told that the new R versions all work 
> on a 32bits processor. I read in other posts that only the old R 
> versions were capable of larger data sets and were running under 64 bit 
> proccesors. I also read that they are adapting the new R version for 64 
> bits proccesors again so does anyone now if there is a version available 
> that we could use?

Huh?  R 2.4.x runs perfectly happily accessing large memory under Linux on 
64bit processors (and Solaris, and probably others). I think it even works 
on Mac OS X now.

For example:
> x<-rnorm(1e9)
> gc()
  used   (Mb) gc trigger   (Mb)   max used   (Mb)
Ncells 222881   12.0 467875   25.0 35   18.7
Vcells 1000115046 7630.3 1000475743 7633.1 1000115558 7630.3


 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]

2006-12-21 Thread Iris Kolder
Thank you all for your help! 

So with all your suggestions we will try to run it on a computer with a 64 bits 
proccesor. But i've been told that the new R versions all work on a 32bits 
processor. I read in other posts that only the old R versions were capable of 
larger data sets and were running under 64 bit proccesors. I also read that 
they are adapting the new R version for 64 bits proccesors again so does anyone 
now if there is a version available that we could use?

Iris Kolder

- Original Message 
From: "Liaw, Andy" <[EMAIL PROTECTED]>
To: Martin Morgan <[EMAIL PROTECTED]>; Iris Kolder <[EMAIL PROTECTED]>
Cc: r-help@stat.math.ethz.ch; N.C. Onland-moret <[EMAIL PROTECTED]>
Sent: Monday, December 18, 2006 7:48:23 PM
Subject: RE: [R] Memory problem on a linux cluster using a large data set 
[Broadcast]


In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:

- Use larger nodesize
- Use sampsize to control the size of bootstrap samples

Both of these have the effect of reducing sizes of trees grown.  For a
data set that large, it may not matter to grow smaller trees.

Still, with data of that size, I'd say 64-bit is the better solution.

Cheers,
Andy

From: Martin Morgan
> 
> Iris --
> 
> I hope the following helps; I think you have too much data 
> for a 32-bit machine.
> 
> Martin
> 
> Iris Kolder <[EMAIL PROTECTED]> writes:
> 
> > Hello,
> >  
> > I have a large data set 320.000 rows and 1000 columns. All the data 
> > has the values 0,1,2.
> 
> It seems like a single copy of this data set will be at least 
> a couple of gigabytes; I think you'll have access to only 4 
> GB on a 32-bit machine (see section 8 of the R Installation 
> and Administration guide), and R will probably end up, even 
> in the best of situations, making at least a couple of copies 
> of your data. Probably you'll need a 64-bit machine, or 
> figure out algorithms that work on chunks of data.
> 
> > on a linux cluster with R version R 2.1.0.  which operates on a 32
> 
> This is quite old, and in general it seems like R has become 
> more sensitive to big-data issues and tracking down 
> unnecessary memory copying.
> 
> > "cannot allocate vector size 1240 kb". I've searched through
> 
> use traceback() or options(error=recover) to figure out where 
> this is actually occurring.
> 
> > SNP <- read.table("file.txt", header=FALSE, sep="")# 
> read in data file
> 
> This makes a data.frame, and data frames have several aspects 
> (e.g., automatic creation of row names on sub-setting) that 
> can be problematic in terms of memory use. Probably better to 
> use a matrix, for which:
> 
>  'read.table' is not the right tool for reading large matrices,
>  especially those with many columns: it is designed to read _data
>  frames_ which may have columns of very different classes. Use
>  'scan' instead.
> 
> (from the help page for read.table). I'm not sure of the 
> details of the algorithms you'll invoke, but it might be a 
> false economy to try to get scan to read in 'small' versions 
> (e.g., integer, rather than
> numeric) of the data -- the algorithms might insist on 
> numeric data, and then make a copy during coercion from your 
> small version to numeric.
> 
> > SNP$total.NAs = rowSums(is.na(SN # calculate the 
> number of NA per row and adds a colum with total Na's
> 
> This adds a column to the data.frame or matrix, probably 
> causing at least one copy of the entire data. Create a 
> separate vector instead, even though this unties the 
> coordination between columns that a data frame provides.
> 
> > SNP  = t(as.matrix(SNP))  # 
> transpose rows and columns
> 
> This will also probably trigger a copy; 
> 
> > snp.na<-SNP
> 
> R might be clever enough to figure out that this simple 
> assignment does not trigger a copy. But it probably means 
> that any subsequent modification of snp.na or SNP *will* 
> trigger a copy, so avoid the assignment if possible.
> 
> > snp.roughfix<-na.roughfix(snp.na)   
>   
> > fSNP<-factor(snp.roughfix[, 1])# Asigns 
> factor to case control status
> >  
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
> > keep.forest=FALSE, do.trace=100)
> 
> Now you're entirely in the hands of the randomForest. If 
> memory problems occur here, perhap

Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]

2006-12-18 Thread Liaw, Andy
In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:

- Use larger nodesize
- Use sampsize to control the size of bootstrap samples

Both of these have the effect of reducing sizes of trees grown.  For a
data set that large, it may not matter to grow smaller trees.

Still, with data of that size, I'd say 64-bit is the better solution.

Cheers,
Andy

From: Martin Morgan
> 
> Iris --
> 
> I hope the following helps; I think you have too much data 
> for a 32-bit machine.
> 
> Martin
> 
> Iris Kolder <[EMAIL PROTECTED]> writes:
> 
> > Hello,
> >  
> > I have a large data set 320.000 rows and 1000 columns. All the data 
> > has the values 0,1,2.
> 
> It seems like a single copy of this data set will be at least 
> a couple of gigabytes; I think you'll have access to only 4 
> GB on a 32-bit machine (see section 8 of the R Installation 
> and Administration guide), and R will probably end up, even 
> in the best of situations, making at least a couple of copies 
> of your data. Probably you'll need a 64-bit machine, or 
> figure out algorithms that work on chunks of data.
> 
> > on a linux cluster with R version R 2.1.0.  which operates on a 32
> 
> This is quite old, and in general it seems like R has become 
> more sensitive to big-data issues and tracking down 
> unnecessary memory copying.
> 
> > "cannot allocate vector size 1240 kb". I've searched through
> 
> use traceback() or options(error=recover) to figure out where 
> this is actually occurring.
> 
> > SNP <- read.table("file.txt", header=FALSE, sep="")# 
> read in data file
> 
> This makes a data.frame, and data frames have several aspects 
> (e.g., automatic creation of row names on sub-setting) that 
> can be problematic in terms of memory use. Probably better to 
> use a matrix, for which:
> 
>  'read.table' is not the right tool for reading large matrices,
>  especially those with many columns: it is designed to read _data
>  frames_ which may have columns of very different classes. Use
>  'scan' instead.
> 
> (from the help page for read.table). I'm not sure of the 
> details of the algorithms you'll invoke, but it might be a 
> false economy to try to get scan to read in 'small' versions 
> (e.g., integer, rather than
> numeric) of the data -- the algorithms might insist on 
> numeric data, and then make a copy during coercion from your 
> small version to numeric.
> 
> > SNP$total.NAs = rowSums(is.na(SN # calculate the 
> number of NA per row and adds a colum with total Na's
> 
> This adds a column to the data.frame or matrix, probably 
> causing at least one copy of the entire data. Create a 
> separate vector instead, even though this unties the 
> coordination between columns that a data frame provides.
> 
> > SNP  = t(as.matrix(SNP))  # 
> transpose rows and columns
> 
> This will also probably trigger a copy; 
> 
> > snp.na<-SNP
> 
> R might be clever enough to figure out that this simple 
> assignment does not trigger a copy. But it probably means 
> that any subsequent modification of snp.na or SNP *will* 
> trigger a copy, so avoid the assignment if possible.
> 
> > snp.roughfix<-na.roughfix(snp.na)   
>   
> > fSNP<-factor(snp.roughfix[, 1])# Asigns 
> factor to case control status
> >  
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
> > keep.forest=FALSE, do.trace=100)
> 
> Now you're entirely in the hands of the randomForest. If 
> memory problems occur here, perhaps you'll have gained enough 
> experience to point the package maintainer to the problem and 
> suggest a possible solution.
> 
> > set it should be able to cope with that amount. Perhaps someone has 
> > tried this before in R or is Fortram a better choice? I added my R
> 
> If you mean a pure Fortran solution, including coding the 
> random forest algorithm, then of course you have complete 
> control over memory management. You'd still likely be limited 
> to addressing 4 GB of memory. 
> 
> 
> > I wrote a script to remove all the rows with more than 46 missing 
> > values. This works perfect on a smaller dataset. But the problem 
> > arises when I try to run it on the larger data set I get an error 
> > "cannot allocate vector size 1240 kb". I've searched 
> through previous 
> > posts and found out that it might be because i'm running it 
> on a linux 
> > cluster with R version R 2.1.0.  which operates on a 32 bit 
> processor. 
> > But I could not find a solution for this problem. The cluster is a 
> > really fast one and should be able to cope with these large 
> amounts of 
> > data the systems configuration are Speed: 3.4 GHz, memory 
> 4GByte. Is 
> > there a way to change the settings or processor under R? I 
> want to run 
> > the function Random Forest on my large data set