Re: [R] Re : Large database help

2006-05-17 Thread Greg Snow
Thanks for doing this Thomas, I have been thinking about what it would
take to do this, but if it were left to me, it would have taken a lot
longer.

Back in the 80's there was a statistical package called RUMMAGE that did
all computations based on sufficient statistics and did not keep the
actual data in memory.  Memory for computers became cheap before
datasets turned huge so there wasn't much demand for the program (and it
never had a nice GUI to help make it popular).  It looks like things are
switching back to that model now though.

Here are a couple of thought that I had that maybe could help with some
future development: 

Another function that could be helpful is bigplot which I imagine would
be best based on the hexbin package, just accumulating the counts in
chunks like your biglm function.  Once I see the code for biglm I may be
able to contribute this piece.  I guess bigbarplot and bigboxplot may
also be useful (accumulating counts for the barplot will be easy, but
does anyone have ideas on the best way to get quantiles for the boxplots
efficiently (the best approach I can think of so far is to have the
database sort the variables, but sorting tends to be slow)).

Another general approach that I thought of would be to read the data in
in chunks, compute the statistic(s) of interest on each chunk (vector of
coefficients for regression models) then average the estimates across
chunks.  Each chunk could be treated as a cluster in a cluster sample
for the averaging and estimating variances for the estimates (if only we
can get the author of the survey package involved :-).  This would
probably be less accurate than your biglm function for regression, but
it would have the flavor of the bootstrapping routines in that it would
work for many cases that don't have their own big methods written yet
(logistic and other glm models, correlations, ...).

Any other thoughts anyone?


-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111
 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Thomas Lumley
Sent: Tuesday, May 16, 2006 3:40 PM
To: roger koenker
Cc: r-help list; Robert Citek
Subject: Re: [R] Re : Large database help

On Tue, 16 May 2006, roger koenker wrote:

 In ancient times, 1999 or so, Alvaro Novo and I experimented with an 
 interface to mysql that brought chunks of data into R and accumulated 
 results.
 This is still described and available on the web in its original form 
 at

   http://www.econ.uiuc.edu/~roger/research/rq/LM.html

 Despite claims of future developments nothing emerged, so anyone 
 considering further explorations with it may need training in 
 Rchaeology.

A few hours ago I submitted to CRAN a package biglm that does large
linear regression models using a similar strategy (it uses incremental
QR decomposition rather than accumalating the crossproduct matrix). It
also computes the Huber/White sandwich variance estimate in the same
single pass over the data.

Assuming I haven't messed up the package checking it will appear in the
next couple of day on CRAN. The syntax looks like
   a - biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)
   a - update(a, chunk2)
   a - update(a, chunk3)
   summary(a)

where chunk1, chunk2, chunk3 are chunks of the data.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Re : Large database help

2006-05-17 Thread Richard M. Heiberger
You might want to follow up by looking at the Data Squashing
that Bill DuMouchel has done

http://citeseer.ist.psu.edu/dumouchel99squashing.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Re : Large database help

2006-05-17 Thread Rogerio Porto
Thank you all for the discussion.

I'll try to summarize the suggestions and give some partial conclusions
for sake of completeness of this thread.

First, I had read the I/O manual but had forgotten the function read.fwf as
suggested by Roger Peng. I'm sorry. But, following manual orientation, this
function is not recommended for large files and I need to discover how to
read fixed-width-format files using scan function, since there isn't such an
example in that manual neither in ?scan. At a glance, it seems the function
read.fwf writes blank spaces among column pointers in order to read the
file using a simple scan() function.

I've also read the I/O manual, mainly chapter 4 about using Relational
Databases.
This suggestion was appointed by Uwe Ligges and Justin Bem who advocated
the use of MySQL with RMySQL package. I'm still installing MySQL to try
to convert my fixed-width-format file to that database but, from the I/O
manual, it seems I can only calculate five descriptive statistics (aggregate
functions). So I couldn't calculate medians or more advanced statistics
like a cluster analysis.
This point was one from Robert Citek and thus, I'm not sure that working
with MySQL will help to solve my problem. RMySQL has dbApply function
that apply R functions to groups (chunks) of database rows.

There was a suggestion to subset the file, by Roger Peng.
Almost all participants in this thread noted the need of lots of RAM to work
with a few variables as suggested by Prof. Brian Ripley.

The future looks promising through a collection *big* of packages specially
designed to handle big data files in almost any hardwarea and OS
configuration although time-demanding in some cases. It seems the first one
in this collection is the biglm package by Thomas Lumley cited by Greg Snow.
The obvious drawback is that one hat to re-write every package that can't
handle big data files or, al least, their most memory demanding operations.
This last point could be implemented by an option like big.file=TRUE to be
incorporated at some functions. This point of view is one of *scaling up*
the methods.

Another promising way is to *scale down* the dataset. Statisticians are
aware of these techniques from non-hierarquical cluster analysis and
principal component analysis among others (mainly sampling). Engineers
and signal processing people know them from data compression techniques. 
Computer scientists work with training sets and dataming wich use methods
to scale down datasets. An example was given by Richard M. Heiberger
who cites a paper from William DuMouchel et al. on Squashing Flat Files.
Maybe could be some R functions specialized in these methods that, using
DBMS, could retrieve significant data (records and variables) that could be
handled by R.

That's all, for a while!

Rogerio.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Re : Large database help

2006-05-16 Thread justin bem
Try to open your db with MySQL and use RMySQL

- Message d'origine 
De : Roger D. Peng [EMAIL PROTECTED]
À : Rogerio Porto [EMAIL PROTECTED]
Cc : r-help@stat.math.ethz.ch
Envoyé le : Mardi, 16 Mai 2006, 1h55mn 41s
Objet : Re: [R] Large database help

You can read fixed-width-files with read.fwf().  But my rough calculation says 
that your dataset will require 40GB of RAM.  I don't think you'll be able to 
read the entire thing into R.  Maybe look at a subset?

-roger

Rogerio Porto wrote:
 Hello all.
 
 I have a large .txt file whose variables are fixed-columns, 
 ie, variable V1 goes from columns 1 to 7, V2 from 8 to 23 etc.
 This is a 60GB file with 90 variables and 60 million observations.
 
 I'm working with a Pentium 4, 1GB RAM, Windows XP Pro.
 I tried the following code just to see if I could work with 2 variables
 but it seems not possible: 
 R : Copyright 2005, The R Foundation for Statistical Computing
 Version 2.2.1  (2005-12-20 r36812)
 ISBN 3-900051-07-0
 gc()
  used (Mb) gc trigger (Mb) max used (Mb)
 Ncells 169011  4.6 35  9.4   35  9.4
 Vcells  62418  0.5 786432  6.0   289957  2.3
 memory.limit(size=4090)
 NULL
 memory.limit()
 [1] 4288675840
 system.time(a-matrix(runif(1e6),nrow=1))
 [1] 0.28 0.02 2.42   NA   NA
 gc()
   used (Mb) gc trigger (Mb) max used (Mb)
 Ncells  171344  4.6 35  9.4   35  9.4
 Vcells 1063212  8.23454398 26.4  4063230 31.0
 rm(a)
 ls()
 character(0)
 system.time(a-matrix(runif(60e6),nrow=1))
 Error: not possible to alocate vector of size 468750 Kb
 Timing stopped at: 7.32 1.95 83.55 NA NA 
 memory.limit(size=5000)
 Erro em memory.size(size) : .4GB
 
 So my questions are:
 1) (newbie) how can I read fixed-columns text files like this?
 2) is there a way I can analyze (statistics like correlations, cluster etc)
 such a large database neither increasing RAM nor changing to 64bit
 machine but still using R and not using a sample? How? 
 
 Thanks in advance.
 
 Rogerio.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html




[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Re : Large database help

2006-05-16 Thread Robert Citek

On May 16, 2006, at 8:15 AM, justin bem wrote:

 Try to open your db with MySQL and use RMySQL

I've seen this offered up as a suggestion a few times but with little  
detail.  In my experience, even using SQL to pull in data from a  
MySQL DB, R would need to load the entire data set into RAM before  
doing some calculations.  But perhaps I'm using RMySQL incorrectly[1].

As a toy problem, let's imagine a data set (foo) with a single  
numerical field (bar) and 1 billion records (1e9).  In MySQL one  
would do the following to calculate the mean:

   select avg(bar) from foo ;

For a smaller data set I would issue a select statement and then  
fetch the entire set into a data frame before calculating the mean.   
Given such a large data set, how would one calculate the mean using R  
connected to this MySQL database?  How would one calculate the median  
using R connected to this MySQL database?

Pointers to references appreciated.

[1] http://www.sourcekeg.co.uk/cran/src/contrib/Descriptions/RMySQL.html

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Re : Large database help

2006-05-16 Thread Prof Brian Ripley
On Tue, 16 May 2006, Robert Citek wrote:


 On May 16, 2006, at 8:15 AM, justin bem wrote:

 Try to open your db with MySQL and use RMySQL

 I've seen this offered up as a suggestion a few times but with little
 detail.  In my experience, even using SQL to pull in data from a
 MySQL DB, R would need to load the entire data set into RAM before
 doing some calculations.  But perhaps I'm using RMySQL incorrectly[1].

 As a toy problem, let's imagine a data set (foo) with a single
 numerical field (bar) and 1 billion records (1e9).  In MySQL one
 would do the following to calculate the mean:

   select avg(bar) from foo ;

 For a smaller data set I would issue a select statement and then
 fetch the entire set into a data frame before calculating the mean.
 Given such a large data set, how would one calculate the mean using R
 connected to this MySQL database?  How would one calculate the median
 using R connected to this MySQL database?

 Pointers to references appreciated.

Well, there *is* a manual about R Data Import/Export, and this does
discuss using R with DBMSs with examples.  How about reading it?

The point being made is that you can import just the columns you need, and 
indeed summaries of those columns.

 [1] http://www.sourcekeg.co.uk/cran/src/contrib/Descriptions/RMySQL.html

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Re : Large database help

2006-05-16 Thread Robert Citek

On May 16, 2006, at 11:19 AM, Prof Brian Ripley wrote:
 Well, there *is* a manual about R Data Import/Export, and this does
 discuss using R with DBMSs with examples.  How about reading it?

Thanks for the pointer:

   http://cran.r-project.org/doc/manuals/R-data.html#Relational- 
databases

Unfortunately, that manual doesn't really answer my question.  My  
question is not about how do I make R interact with a database, but  
rather how do I make R interact with a database containing large sets.

 The point being made is that you can import just the columns you  
 need, and indeed summaries of those columns.

That sounds great in theory.  Now I want to reduce it to practice.   
In the toy problem from the previous post, how can one compute the  
mean of a set of 1e9 numbers?  R has some difficulty generating a  
billion (1e9) number set let alone taking the mean of that set.  To wit:

   bigset - runif(1e9,0,1e9)

runs out of memory on my system.  I realize that I can do some fancy  
data shuffling and hand-waving to calculate the mean.  But I was  
wondering if R has a module that already abstracts out that magic,  
perhaps using a database.

Any pointers to more detailed reading is greatly appreciated.

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Re : Large database help

2006-05-16 Thread roger koenker
In ancient times, 1999 or so, Alvaro Novo and I experimented with an
interface to mysql that brought chunks of data into R and accumulated  
results.
This is still described and available on the web in its original form at

http://www.econ.uiuc.edu/~roger/research/rq/LM.html

Despite claims of future developments nothing emerged, so anyone
considering further explorations with it may need training in  
Rchaeology.

The toy problem we were solving was a large least squares problem,
which was a stalking horse for large quantile regression  problems.   
Around the same
time I discovered sparse linear algebra and realized that virtually all
large problems that I was interested in were better handled in from
that perspective.

url:www.econ.uiuc.edu/~rogerRoger Koenker
email[EMAIL PROTECTED]Department of Economics
vox: 217-333-4558University of Illinois
fax:   217-244-6678Champaign, IL 61820


On May 16, 2006, at 3:57 PM, Robert Citek wrote:


 On May 16, 2006, at 11:19 AM, Prof Brian Ripley wrote:
 Well, there *is* a manual about R Data Import/Export, and this does
 discuss using R with DBMSs with examples.  How about reading it?

 Thanks for the pointer:

http://cran.r-project.org/doc/manuals/R-data.html#Relational-
 databases

 Unfortunately, that manual doesn't really answer my question.  My
 question is not about how do I make R interact with a database, but
 rather how do I make R interact with a database containing large sets.

 The point being made is that you can import just the columns you
 need, and indeed summaries of those columns.

 That sounds great in theory.  Now I want to reduce it to practice.
 In the toy problem from the previous post, how can one compute the
 mean of a set of 1e9 numbers?  R has some difficulty generating a
 billion (1e9) number set let alone taking the mean of that set.  To  
 wit:

bigset - runif(1e9,0,1e9)

 runs out of memory on my system.  I realize that I can do some fancy
 data shuffling and hand-waving to calculate the mean.  But I was
 wondering if R has a module that already abstracts out that magic,
 perhaps using a database.

 Any pointers to more detailed reading is greatly appreciated.

 Regards,
 - Robert
 http://www.cwelug.org/downloads
 Help others get OpenSource software.  Distribute FLOSS
 for Windows, Linux, *BSD, and MacOS X with BitTorrent

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting- 
 guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Re : Large database help

2006-05-16 Thread Thomas Lumley
On Tue, 16 May 2006, roger koenker wrote:

 In ancient times, 1999 or so, Alvaro Novo and I experimented with an
 interface to mysql that brought chunks of data into R and accumulated
 results.
 This is still described and available on the web in its original form at

   http://www.econ.uiuc.edu/~roger/research/rq/LM.html

 Despite claims of future developments nothing emerged, so anyone
 considering further explorations with it may need training in
 Rchaeology.

A few hours ago I submitted to CRAN a package biglm that does large 
linear regression models using a similar strategy (it uses incremental QR 
decomposition rather than accumalating the crossproduct matrix). It also 
computes the Huber/White sandwich variance estimate in the same single 
pass over the data.

Assuming I haven't messed up the package checking it will appear 
in the next couple of day on CRAN. The syntax looks like
   a - biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)
   a - update(a, chunk2)
   a - update(a, chunk3)
   summary(a)

where chunk1, chunk2, chunk3 are chunks of the data.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html