Re: [R] How to deal with more than 6GB dataset using R?

2010-08-01 Thread Jing Li
I tried several ways:

1.  I used the scan() function, it can read the 6GB file into the memory
without difficulty, just took some time. But just read into the memory was
definitely not enough, when I did the next step, which was to plot() and
then tried to build the nonlinear regression model, it was stucked at the
plot() part, since it has already reached the memory limit, even though I
have 64-bit version system and huge memory size.

2. I tried the bigmemory() package. It can read the dataset into the memory
as well, but since it stores the data into a matrix format, and the normal
functions such as nls(), plot()... cannot work on matrices--that is the
problem. What should I do then?

Or do I need to change to SAS? I believe there are a lot of people who are
dealing with large datasets, what did you do in this situation?

Thanks.




2010/7/24 babyfoxlo...@sina.com


 -- Original Message --

 You may want to look at the biglm package as another way to regression
 models on very large data sets.

 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 greg.s...@imail.org
 801.408.8111


  -Original Message-
  From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
  project.org] On Behalf Of babyfoxlo...@sina.com
  Sent: Friday, July 23, 2010 10:10 AM
  To: r-help@r-project.org
  Subject: [R] How to deal with more than 6GB dataset using R?
 
   Hi there,
 
  Sorry to bother those who are not interested in this problem.
 
  I'm dealing with a large data set, more than 6 GB file, and doing
  regression test with those data. I was wondering are there any
  efficient ways to read those data? Instead of just using read.table()?
  BTW, I'm using a 64bit version desktop and a 64bit version R, and the
  memory for the desktop is enough for me to use.
  Thanks.
 
 
  --Gin
 
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-
  guide.html
  and provide commented, minimal, self-contained, reproducible code.




-- 
Best,
Jing Li

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-28 Thread Jens Oehlschlägel
Matthew,

You might want to look at function read.table.ffdf in the ff package, which can 
read large csv files in chunks and store the result  in a binary format on disk 
that can be quickly accessed from R. ff allows you to access complete columns 
(returned as a vector or array) or subsets of the data identified by 
row-positions (and column selection, returned as a data.frame). As Jim pointed 
out: all depends on what you are going with the data. If you want to access 
subsets not by row-position but rather by search conditions, you are better-off 
with an indexed database. 

Please let me know if you write a fast read.fwf.ffdf - we would be happy to 
include it into the ff package.


Jens

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-27 Thread Matthew Keller
I've found that opening a connection, and scanning (in a loop)
line-by-line, is far faster than either read.table or read.fwf. E.g,
here's a file (temp2) that has 1500 rows and 550K columns:

showConnections(all=TRUE)
con - file(temp2,open='r')
system.time({
for (i in 0:(num.samp-1)){
  new.gen[i+1,] - scan(con,what='integer',nlines=1)}
})
close(con)
#THIS TAKES 4.6 MINUTES




system.time({
new.gen2 - 
read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols))
})
#THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT)


This seems surprising to me. Can anyone see some other way to speed
this type of thing up?

Matt


On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow greg.s...@imail.org wrote:
 You may want to look at the biglm package as another way to regression models 
 on very large data sets.

 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 greg.s...@imail.org
 801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of babyfoxlo...@sina.com
 Sent: Friday, July 23, 2010 10:10 AM
 To: r-help@r-project.org
 Subject: [R] How to deal with more than 6GB dataset using R?

 nbsp;Hi there,

 Sorry to bother those who are not interested in this problem.

 I'm dealing with a large data set, more than 6 GB file, and doing
 regression test with those data. I was wondering are there any
 efficient ways to read those data? Instead of just using read.table()?
 BTW, I'm using a 64bit version desktop and a 64bit version R, and the
 memory for the desktop is enough for me to use.
 Thanks.


 --Gin

       [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-27 Thread jim holtman
It all depends on what you are going with the data.  First in your
scan example, I would not read in a line at a time, but probably
several thousand and then process the data.  Most of your time is
probably spent in reading.  I assume that you are not reading it all
in at once (but then maybe you are since you have a 64-bit version).
It is also good to understand what read.fwf is doing.  It is reading
in the file, parsing it by columns, writing it with a separator to a
temporary file and then reading that file in with read.table to get
the final result -- that is one of the reasons it is taking so long.

You might also consider putting the data into a database and then
reading the required instances out of there.  But it is hard to give
specific advice since we don't know how you want to with the data.
But in any case at least read a good portion (several MBs at a time)
to get the economy of scale and not a line at a time.

Here is an example of reading in a csv file with 666,000 lines at 1
line per 'scan', 10 lines, 1000 lines and 1 lines.  Notice that at
nlines=1 it take 30 CPU seconds to process the data; at nlines=1000,
it take 2.8 (10X faster).  So time various options to see what
happens.

 input - file(file, 'r')
 n - 1  # lines to read
 system.time({
+ repeat{
+ lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+ if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
  29.520.08   29.90
 close(input)
 input - file(file, 'r')
 n - 10  # lines to read
 system.time({
+ repeat{
+ lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+ if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
   5.930.005.99
 close(input)
 input - file(file, 'r')
 n - 1000  # lines to read
 system.time({
+ repeat{
+ lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+ if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
   2.790.082.90
 close(input)
 n - 1  # lines to read
 system.time({
+ repeat{
+ lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE)
+ if (length(lines[[1]]) == 0) break
+
+ }
+ })
   user  system elapsed
   2.760.002.76
 close(input)






On Tue, Jul 27, 2010 at 4:25 PM, Matthew Keller mckellerc...@gmail.com wrote:
 I've found that opening a connection, and scanning (in a loop)
 line-by-line, is far faster than either read.table or read.fwf. E.g,
 here's a file (temp2) that has 1500 rows and 550K columns:

 showConnections(all=TRUE)
 con - file(temp2,open='r')
 system.time({
 for (i in 0:(num.samp-1)){
  new.gen[i+1,] - scan(con,what='integer',nlines=1)}
 })
 close(con)
 #THIS TAKES 4.6 MINUTES




 system.time({
 new.gen2 - 
 read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols))
 })
 #THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT)


 This seems surprising to me. Can anyone see some other way to speed
 this type of thing up?

 Matt


 On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow greg.s...@imail.org wrote:
 You may want to look at the biglm package as another way to regression 
 models on very large data sets.

 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 greg.s...@imail.org
 801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of babyfoxlo...@sina.com
 Sent: Friday, July 23, 2010 10:10 AM
 To: r-help@r-project.org
 Subject: [R] How to deal with more than 6GB dataset using R?

 nbsp;Hi there,

 Sorry to bother those who are not interested in this problem.

 I'm dealing with a large data set, more than 6 GB file, and doing
 regression test with those data. I was wondering are there any
 efficient ways to read those data? Instead of just using read.table()?
 BTW, I'm using a 64bit version desktop and a 64bit version R, and the
 memory for the desktop is enough for me to use.
 Thanks.


 --Gin

       [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Matthew C Keller
 Asst. Professor of Psychology
 University of Colorado at Boulder
 www.matthewckeller.com

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained

Re: [R] How to deal with more than 6GB dataset using R?

2010-07-24 Thread Greg Snow
You may want to look at the biglm package as another way to regression models 
on very large data sets.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of babyfoxlo...@sina.com
 Sent: Friday, July 23, 2010 10:10 AM
 To: r-help@r-project.org
 Subject: [R] How to deal with more than 6GB dataset using R?
 
 nbsp;Hi there,
 
 Sorry to bother those who are not interested in this problem.
 
 I'm dealing with a large data set, more than 6 GB file, and doing
 regression test with those data. I was wondering are there any
 efficient ways to read those data? Instead of just using read.table()?
 BTW, I'm using a 64bit version desktop and a 64bit version R, and the
 memory for the desktop is enough for me to use.
 Thanks.
 
 
 --Gin
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to deal with more than 6GB dataset using R?

2010-07-23 Thread babyfoxlove1
nbsp;Hi there,

Sorry to bother those who are not interested in this problem.

I'm dealing with a large data set, more than 6 GB file, and doing regression 
test with those data. I was wondering are there any efficient ways to read 
those data? Instead of just using read.table()? BTW, I'm using a 64bit version 
desktop and a 64bit version R, and the memory for the desktop is enough for me 
to use.
Thanks.


--Gin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-23 Thread Duncan Murdoch

On 23/07/2010 12:10 PM, babyfoxlo...@sina.com wrote:

nbsp;Hi there,

Sorry to bother those who are not interested in this problem.

I'm dealing with a large data set, more than 6 GB file, and doing regression 
test with those data. I was wondering are there any efficient ways to read 
those data? Instead of just using read.table()? BTW, I'm using a 64bit version 
desktop and a 64bit version R, and the memory for the desktop is enough for me 
to use.
Thanks.
  


You probably won't get much faster than read.table with all of the 
colClasses specified.  It will be a lot slower if you leave that at the 
default NA setting, because then R needs to figure out the types by 
reading them as character and examining all the values.  If the file is 
very consistently structured (e.g. the same number of characters in 
every value in every row) you might be able to write a C function to 
read it faster, but I'd guess the time spent writing that would be a lot 
more than the time saved.


Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-23 Thread Allan Engelhardt
read.table is not very inefficient IF you specify the colClasses= 
parameter.  scan (with the what= parameter) is probably a little more 
efficient.  In either case, save the data using save() once you have it 
in the right structure and it will be much more efficient to read it 
next time.  (In fact I often exit R at this stage and re-start it with 
the .RData file before I start the analysis to clear out the memory.)


I did a lot of testing on the types of (large) data structures I 
normally work with and found that options(save.defaults = 
list(compress=bzip2, compression_level=6, ascii=FALSE)) gave me the 
best trade-off between size and speed.  Your mileage will undoubtedly 
vary, but if you do this a lot it may be worth getting hard data for 
your setup.


Hope this helps a little.

Allan

On 23/07/10 17:10, babyfoxlo...@sina.com wrote:

nbsp;Hi there,

Sorry to bother those who are not interested in this problem.

I'm dealing with a large data set, more than 6 GB file, and doing regression 
test with those data. I was wondering are there any efficient ways to read 
those data? Instead of just using read.table()? BTW, I'm using a 64bit version 
desktop and a 64bit version R, and the memory for the desktop is enough for me 
to use.
Thanks.


--Gin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-23 Thread Allan Engelhardt

On 23/07/10 17:36, Duncan Murdoch wrote:

On 23/07/2010 12:10 PM, babyfoxlo...@sina.com wrote:

[...]


You probably won't get much faster than read.table with all of the 
colClasses specified.  It will be a lot slower if you leave that at 
the default NA setting, because then R needs to figure out the types 
by reading them as character and examining all the values.  If the 
file is very consistently structured (e.g. the same number of 
characters in every value in every row) you might be able to write a C 
function to read it faster, but I'd guess the time spent writing that 
would be a lot more than the time saved.


And try the utils::read.fwf() function before you roll your own C code 
for this use case.


If you do write C code, consider writing a converter to .RData format 
which R seems to read quite efficiently.


Hope this helps.

Allan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.