Re: [R] Re : Large database help
Thanks for doing this Thomas, I have been thinking about what it would take to do this, but if it were left to me, it would have taken a lot longer. Back in the 80's there was a statistical package called RUMMAGE that did all computations based on sufficient statistics and did not keep the actual data in memory. Memory for computers became cheap before datasets turned huge so there wasn't much demand for the program (and it never had a nice GUI to help make it popular). It looks like things are switching back to that model now though. Here are a couple of thought that I had that maybe could help with some future development: Another function that could be helpful is bigplot which I imagine would be best based on the hexbin package, just accumulating the counts in chunks like your biglm function. Once I see the code for biglm I may be able to contribute this piece. I guess bigbarplot and bigboxplot may also be useful (accumulating counts for the barplot will be easy, but does anyone have ideas on the best way to get quantiles for the boxplots efficiently (the best approach I can think of so far is to have the database sort the variables, but sorting tends to be slow)). Another general approach that I thought of would be to read the data in in chunks, compute the statistic(s) of interest on each chunk (vector of coefficients for regression models) then average the estimates across chunks. Each chunk could be treated as a cluster in a cluster sample for the averaging and estimating variances for the estimates (if only we can get the author of the survey package involved :-). This would probably be less accurate than your biglm function for regression, but it would have the flavor of the bootstrapping routines in that it would work for many cases that don't have their own big methods written yet (logistic and other glm models, correlations, ...). Any other thoughts anyone? -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Thomas Lumley Sent: Tuesday, May 16, 2006 3:40 PM To: roger koenker Cc: r-help list; Robert Citek Subject: Re: [R] Re : Large database help On Tue, 16 May 2006, roger koenker wrote: In ancient times, 1999 or so, Alvaro Novo and I experimented with an interface to mysql that brought chunks of data into R and accumulated results. This is still described and available on the web in its original form at http://www.econ.uiuc.edu/~roger/research/rq/LM.html Despite claims of future developments nothing emerged, so anyone considering further explorations with it may need training in Rchaeology. A few hours ago I submitted to CRAN a package biglm that does large linear regression models using a similar strategy (it uses incremental QR decomposition rather than accumalating the crossproduct matrix). It also computes the Huber/White sandwich variance estimate in the same single pass over the data. Assuming I haven't messed up the package checking it will appear in the next couple of day on CRAN. The syntax looks like a - biglm(log(Volume) ~ log(Girth) + log(Height), chunk1) a - update(a, chunk2) a - update(a, chunk3) summary(a) where chunk1, chunk2, chunk3 are chunks of the data. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
You might want to follow up by looking at the Data Squashing that Bill DuMouchel has done http://citeseer.ist.psu.edu/dumouchel99squashing.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
Thank you all for the discussion. I'll try to summarize the suggestions and give some partial conclusions for sake of completeness of this thread. First, I had read the I/O manual but had forgotten the function read.fwf as suggested by Roger Peng. I'm sorry. But, following manual orientation, this function is not recommended for large files and I need to discover how to read fixed-width-format files using scan function, since there isn't such an example in that manual neither in ?scan. At a glance, it seems the function read.fwf writes blank spaces among column pointers in order to read the file using a simple scan() function. I've also read the I/O manual, mainly chapter 4 about using Relational Databases. This suggestion was appointed by Uwe Ligges and Justin Bem who advocated the use of MySQL with RMySQL package. I'm still installing MySQL to try to convert my fixed-width-format file to that database but, from the I/O manual, it seems I can only calculate five descriptive statistics (aggregate functions). So I couldn't calculate medians or more advanced statistics like a cluster analysis. This point was one from Robert Citek and thus, I'm not sure that working with MySQL will help to solve my problem. RMySQL has dbApply function that apply R functions to groups (chunks) of database rows. There was a suggestion to subset the file, by Roger Peng. Almost all participants in this thread noted the need of lots of RAM to work with a few variables as suggested by Prof. Brian Ripley. The future looks promising through a collection *big* of packages specially designed to handle big data files in almost any hardwarea and OS configuration although time-demanding in some cases. It seems the first one in this collection is the biglm package by Thomas Lumley cited by Greg Snow. The obvious drawback is that one hat to re-write every package that can't handle big data files or, al least, their most memory demanding operations. This last point could be implemented by an option like big.file=TRUE to be incorporated at some functions. This point of view is one of *scaling up* the methods. Another promising way is to *scale down* the dataset. Statisticians are aware of these techniques from non-hierarquical cluster analysis and principal component analysis among others (mainly sampling). Engineers and signal processing people know them from data compression techniques. Computer scientists work with training sets and dataming wich use methods to scale down datasets. An example was given by Richard M. Heiberger who cites a paper from William DuMouchel et al. on Squashing Flat Files. Maybe could be some R functions specialized in these methods that, using DBMS, could retrieve significant data (records and variables) that could be handled by R. That's all, for a while! Rogerio. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re : Large database help
Try to open your db with MySQL and use RMySQL - Message d'origine De : Roger D. Peng [EMAIL PROTECTED] Ã⬠: Rogerio Porto [EMAIL PROTECTED] Cc : r-help@stat.math.ethz.ch Envoyé le : Mardi, 16 Mai 2006, 1h55mn 41s Objetà : Re: [R] Large database help You can read fixed-width-files with read.fwf(). But my rough calculation says that your dataset will require 40GB of RAM. I don't think you'll be able to read the entire thing into R. Maybe look at a subset? -roger Rogerio Porto wrote: Hello all. I have a large .txt file whose variables are fixed-columns, ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc. This is a 60GB file with 90 variables and 60 million observations. I'm working with a Pentium 4, 1GB RAM, Windows XP Pro. I tried the following code just to see if I could work with 2 variables but it seems not possible: R : Copyright 2005, The R Foundation for Statistical Computing Version 2.2.1 (2005-12-20 r36812) ISBN 3-900051-07-0 gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 169011 4.6 35 9.4 35 9.4 Vcells 62418 0.5 786432 6.0 289957 2.3 memory.limit(size=4090) NULL memory.limit() [1] 4288675840 system.time(a-matrix(runif(1e6),nrow=1)) [1] 0.28 0.02 2.42 NA NA gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 171344 4.6 35 9.4 35 9.4 Vcells 1063212 8.23454398 26.4 4063230 31.0 rm(a) ls() character(0) system.time(a-matrix(runif(60e6),nrow=1)) Error: not possible to alocate vector of size 468750 Kb Timing stopped at: 7.32 1.95 83.55 NA NA memory.limit(size=5000) Erro em memory.size(size) : .4GB So my questions are: 1) (newbie) how can I read fixed-columns text files like this? 2) is there a way I can analyze (statistics like correlations, cluster etc) such a large database neither increasing RAM nor changing to 64bit machine but still using R and not using a sample? How? Thanks in advance. Rogerio. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
On May 16, 2006, at 8:15 AM, justin bem wrote: Try to open your db with MySQL and use RMySQL I've seen this offered up as a suggestion a few times but with little detail. In my experience, even using SQL to pull in data from a MySQL DB, R would need to load the entire data set into RAM before doing some calculations. But perhaps I'm using RMySQL incorrectly[1]. As a toy problem, let's imagine a data set (foo) with a single numerical field (bar) and 1 billion records (1e9). In MySQL one would do the following to calculate the mean: select avg(bar) from foo ; For a smaller data set I would issue a select statement and then fetch the entire set into a data frame before calculating the mean. Given such a large data set, how would one calculate the mean using R connected to this MySQL database? How would one calculate the median using R connected to this MySQL database? Pointers to references appreciated. [1] http://www.sourcekeg.co.uk/cran/src/contrib/Descriptions/RMySQL.html Regards, - Robert http://www.cwelug.org/downloads Help others get OpenSource software. Distribute FLOSS for Windows, Linux, *BSD, and MacOS X with BitTorrent __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
On Tue, 16 May 2006, Robert Citek wrote: On May 16, 2006, at 8:15 AM, justin bem wrote: Try to open your db with MySQL and use RMySQL I've seen this offered up as a suggestion a few times but with little detail. In my experience, even using SQL to pull in data from a MySQL DB, R would need to load the entire data set into RAM before doing some calculations. But perhaps I'm using RMySQL incorrectly[1]. As a toy problem, let's imagine a data set (foo) with a single numerical field (bar) and 1 billion records (1e9). In MySQL one would do the following to calculate the mean: select avg(bar) from foo ; For a smaller data set I would issue a select statement and then fetch the entire set into a data frame before calculating the mean. Given such a large data set, how would one calculate the mean using R connected to this MySQL database? How would one calculate the median using R connected to this MySQL database? Pointers to references appreciated. Well, there *is* a manual about R Data Import/Export, and this does discuss using R with DBMSs with examples. How about reading it? The point being made is that you can import just the columns you need, and indeed summaries of those columns. [1] http://www.sourcekeg.co.uk/cran/src/contrib/Descriptions/RMySQL.html -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
On May 16, 2006, at 11:19 AM, Prof Brian Ripley wrote: Well, there *is* a manual about R Data Import/Export, and this does discuss using R with DBMSs with examples. How about reading it? Thanks for the pointer: http://cran.r-project.org/doc/manuals/R-data.html#Relational- databases Unfortunately, that manual doesn't really answer my question. My question is not about how do I make R interact with a database, but rather how do I make R interact with a database containing large sets. The point being made is that you can import just the columns you need, and indeed summaries of those columns. That sounds great in theory. Now I want to reduce it to practice. In the toy problem from the previous post, how can one compute the mean of a set of 1e9 numbers? R has some difficulty generating a billion (1e9) number set let alone taking the mean of that set. To wit: bigset - runif(1e9,0,1e9) runs out of memory on my system. I realize that I can do some fancy data shuffling and hand-waving to calculate the mean. But I was wondering if R has a module that already abstracts out that magic, perhaps using a database. Any pointers to more detailed reading is greatly appreciated. Regards, - Robert http://www.cwelug.org/downloads Help others get OpenSource software. Distribute FLOSS for Windows, Linux, *BSD, and MacOS X with BitTorrent __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
In ancient times, 1999 or so, Alvaro Novo and I experimented with an interface to mysql that brought chunks of data into R and accumulated results. This is still described and available on the web in its original form at http://www.econ.uiuc.edu/~roger/research/rq/LM.html Despite claims of future developments nothing emerged, so anyone considering further explorations with it may need training in Rchaeology. The toy problem we were solving was a large least squares problem, which was a stalking horse for large quantile regression problems. Around the same time I discovered sparse linear algebra and realized that virtually all large problems that I was interested in were better handled in from that perspective. url:www.econ.uiuc.edu/~rogerRoger Koenker email[EMAIL PROTECTED]Department of Economics vox: 217-333-4558University of Illinois fax: 217-244-6678Champaign, IL 61820 On May 16, 2006, at 3:57 PM, Robert Citek wrote: On May 16, 2006, at 11:19 AM, Prof Brian Ripley wrote: Well, there *is* a manual about R Data Import/Export, and this does discuss using R with DBMSs with examples. How about reading it? Thanks for the pointer: http://cran.r-project.org/doc/manuals/R-data.html#Relational- databases Unfortunately, that manual doesn't really answer my question. My question is not about how do I make R interact with a database, but rather how do I make R interact with a database containing large sets. The point being made is that you can import just the columns you need, and indeed summaries of those columns. That sounds great in theory. Now I want to reduce it to practice. In the toy problem from the previous post, how can one compute the mean of a set of 1e9 numbers? R has some difficulty generating a billion (1e9) number set let alone taking the mean of that set. To wit: bigset - runif(1e9,0,1e9) runs out of memory on my system. I realize that I can do some fancy data shuffling and hand-waving to calculate the mean. But I was wondering if R has a module that already abstracts out that magic, perhaps using a database. Any pointers to more detailed reading is greatly appreciated. Regards, - Robert http://www.cwelug.org/downloads Help others get OpenSource software. Distribute FLOSS for Windows, Linux, *BSD, and MacOS X with BitTorrent __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting- guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re : Large database help
On Tue, 16 May 2006, roger koenker wrote: In ancient times, 1999 or so, Alvaro Novo and I experimented with an interface to mysql that brought chunks of data into R and accumulated results. This is still described and available on the web in its original form at http://www.econ.uiuc.edu/~roger/research/rq/LM.html Despite claims of future developments nothing emerged, so anyone considering further explorations with it may need training in Rchaeology. A few hours ago I submitted to CRAN a package biglm that does large linear regression models using a similar strategy (it uses incremental QR decomposition rather than accumalating the crossproduct matrix). It also computes the Huber/White sandwich variance estimate in the same single pass over the data. Assuming I haven't messed up the package checking it will appear in the next couple of day on CRAN. The syntax looks like a - biglm(log(Volume) ~ log(Girth) + log(Height), chunk1) a - update(a, chunk2) a - update(a, chunk3) summary(a) where chunk1, chunk2, chunk3 are chunks of the data. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html