Re: [R] How to deal with more than 6GB dataset using R?
I tried several ways: 1. I used the scan() function, it can read the 6GB file into the memory without difficulty, just took some time. But just read into the memory was definitely not enough, when I did the next step, which was to plot() and then tried to build the nonlinear regression model, it was stucked at the plot() part, since it has already reached the memory limit, even though I have 64-bit version system and huge memory size. 2. I tried the bigmemory() package. It can read the dataset into the memory as well, but since it stores the data into a matrix format, and the normal functions such as nls(), plot()... cannot work on matrices--that is the problem. What should I do then? Or do I need to change to SAS? I believe there are a lot of people who are dealing with large datasets, what did you do in this situation? Thanks. 2010/7/24 babyfoxlo...@sina.com -- Original Message -- You may want to look at the biglm package as another way to regression models on very large data sets. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of babyfoxlo...@sina.com Sent: Friday, July 23, 2010 10:10 AM To: r-help@r-project.org Subject: [R] How to deal with more than 6GB dataset using R? Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. -- Best, Jing Li [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to deal with more than 6GB dataset using R?
Matthew, You might want to look at function read.table.ffdf in the ff package, which can read large csv files in chunks and store the result in a binary format on disk that can be quickly accessed from R. ff allows you to access complete columns (returned as a vector or array) or subsets of the data identified by row-positions (and column selection, returned as a data.frame). As Jim pointed out: all depends on what you are going with the data. If you want to access subsets not by row-position but rather by search conditions, you are better-off with an indexed database. Please let me know if you write a fast read.fwf.ffdf - we would be happy to include it into the ff package. Jens __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to deal with more than 6GB dataset using R?
I've found that opening a connection, and scanning (in a loop) line-by-line, is far faster than either read.table or read.fwf. E.g, here's a file (temp2) that has 1500 rows and 550K columns: showConnections(all=TRUE) con - file(temp2,open='r') system.time({ for (i in 0:(num.samp-1)){ new.gen[i+1,] - scan(con,what='integer',nlines=1)} }) close(con) #THIS TAKES 4.6 MINUTES system.time({ new.gen2 - read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols)) }) #THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT) This seems surprising to me. Can anyone see some other way to speed this type of thing up? Matt On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow greg.s...@imail.org wrote: You may want to look at the biglm package as another way to regression models on very large data sets. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of babyfoxlo...@sina.com Sent: Friday, July 23, 2010 10:10 AM To: r-help@r-project.org Subject: [R] How to deal with more than 6GB dataset using R? nbsp;Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to deal with more than 6GB dataset using R?
It all depends on what you are going with the data. First in your scan example, I would not read in a line at a time, but probably several thousand and then process the data. Most of your time is probably spent in reading. I assume that you are not reading it all in at once (but then maybe you are since you have a 64-bit version). It is also good to understand what read.fwf is doing. It is reading in the file, parsing it by columns, writing it with a separator to a temporary file and then reading that file in with read.table to get the final result -- that is one of the reasons it is taking so long. You might also consider putting the data into a database and then reading the required instances out of there. But it is hard to give specific advice since we don't know how you want to with the data. But in any case at least read a good portion (several MBs at a time) to get the economy of scale and not a line at a time. Here is an example of reading in a csv file with 666,000 lines at 1 line per 'scan', 10 lines, 1000 lines and 1 lines. Notice that at nlines=1 it take 30 CPU seconds to process the data; at nlines=1000, it take 2.8 (10X faster). So time various options to see what happens. input - file(file, 'r') n - 1 # lines to read system.time({ + repeat{ + lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE) + if (length(lines[[1]]) == 0) break + + } + }) user system elapsed 29.520.08 29.90 close(input) input - file(file, 'r') n - 10 # lines to read system.time({ + repeat{ + lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE) + if (length(lines[[1]]) == 0) break + + } + }) user system elapsed 5.930.005.99 close(input) input - file(file, 'r') n - 1000 # lines to read system.time({ + repeat{ + lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE) + if (length(lines[[1]]) == 0) break + + } + }) user system elapsed 2.790.082.90 close(input) n - 1 # lines to read system.time({ + repeat{ + lines - scan(input, what=list('',''), sep=',', nlines=n,quiet=TRUE) + if (length(lines[[1]]) == 0) break + + } + }) user system elapsed 2.760.002.76 close(input) On Tue, Jul 27, 2010 at 4:25 PM, Matthew Keller mckellerc...@gmail.com wrote: I've found that opening a connection, and scanning (in a loop) line-by-line, is far faster than either read.table or read.fwf. E.g, here's a file (temp2) that has 1500 rows and 550K columns: showConnections(all=TRUE) con - file(temp2,open='r') system.time({ for (i in 0:(num.samp-1)){ new.gen[i+1,] - scan(con,what='integer',nlines=1)} }) close(con) #THIS TAKES 4.6 MINUTES system.time({ new.gen2 - read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols)) }) #THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT) This seems surprising to me. Can anyone see some other way to speed this type of thing up? Matt On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow greg.s...@imail.org wrote: You may want to look at the biglm package as another way to regression models on very large data sets. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of babyfoxlo...@sina.com Sent: Friday, July 23, 2010 10:10 AM To: r-help@r-project.org Subject: [R] How to deal with more than 6GB dataset using R? nbsp;Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained
Re: [R] How to deal with more than 6GB dataset using R?
You may want to look at the biglm package as another way to regression models on very large data sets. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of babyfoxlo...@sina.com Sent: Friday, July 23, 2010 10:10 AM To: r-help@r-project.org Subject: [R] How to deal with more than 6GB dataset using R? nbsp;Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to deal with more than 6GB dataset using R?
nbsp;Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to deal with more than 6GB dataset using R?
On 23/07/2010 12:10 PM, babyfoxlo...@sina.com wrote: nbsp;Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. You probably won't get much faster than read.table with all of the colClasses specified. It will be a lot slower if you leave that at the default NA setting, because then R needs to figure out the types by reading them as character and examining all the values. If the file is very consistently structured (e.g. the same number of characters in every value in every row) you might be able to write a C function to read it faster, but I'd guess the time spent writing that would be a lot more than the time saved. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to deal with more than 6GB dataset using R?
read.table is not very inefficient IF you specify the colClasses= parameter. scan (with the what= parameter) is probably a little more efficient. In either case, save the data using save() once you have it in the right structure and it will be much more efficient to read it next time. (In fact I often exit R at this stage and re-start it with the .RData file before I start the analysis to clear out the memory.) I did a lot of testing on the types of (large) data structures I normally work with and found that options(save.defaults = list(compress=bzip2, compression_level=6, ascii=FALSE)) gave me the best trade-off between size and speed. Your mileage will undoubtedly vary, but if you do this a lot it may be worth getting hard data for your setup. Hope this helps a little. Allan On 23/07/10 17:10, babyfoxlo...@sina.com wrote: nbsp;Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to deal with more than 6GB dataset using R?
On 23/07/10 17:36, Duncan Murdoch wrote: On 23/07/2010 12:10 PM, babyfoxlo...@sina.com wrote: [...] You probably won't get much faster than read.table with all of the colClasses specified. It will be a lot slower if you leave that at the default NA setting, because then R needs to figure out the types by reading them as character and examining all the values. If the file is very consistently structured (e.g. the same number of characters in every value in every row) you might be able to write a C function to read it faster, but I'd guess the time spent writing that would be a lot more than the time saved. And try the utils::read.fwf() function before you roll your own C code for this use case. If you do write C code, consider writing a converter to .RData format which R seems to read quite efficiently. Hope this helps. Allan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.