Re: [R] Can we do GLM on 2GB data set with R?

2007-01-21 Thread Prof Brian Ripley
'given sufficient hardware' and a suitable OS, 'yes'.

You will see quoted on this list from time to time:

 library(fortunes)
 fortune(Yoda)

Evelyn Hall: I would like to know how (if) I can extract some of the
information from the summary of my nlme.
Simon Blomberg: This is R. There is no if. Only how.
-- Evelyn Hall and Simon `Yoda' Blomberg
   R-help (April 2005)

You then mention 'my PC'.  If your 'PC' is running Windows, the answer is 
'with some work', since we don't have a version of R for Win64 and Win32 
is limited to 3GB user address space.

To be more precise, we would have to know more about your GLM (and even if 
you mean GLM in the commonly accepted sense or the SASism with a redundant 
G), including what the variables are (I guess categorical stored as small 
integers?)

My DPhil student Fei Chen looked at ways of applying R to large GLMs with 
data stored in a MySQL database.  His tests were about 4 years ago on 
32-bit Linux, and he was able to run about 1 million cases on 30 
categorical (mainly binary) variables with (I think) up to 5-way 
interactions.  That is a very large GLM problem, and it is unusual for
it to be worth fitting a (mainly linear) model with over 10,000 cases.
(Also, there are normally problems with the homogeneity of very large 
datasets that taint the independence assumptions made by GLMs.)

My guess is that you have been considering the function glm().  There is 
function bigglm() in package biglm (by Thomas Lumley).  I don't think you 
would be able even to load your data into 32-bit R, but it would be 
possible to use the ideas behind bigglm (which was one of the approaches 
Fei assessed) and perhaps even bigglm itself with one of the DBMS 
interfaces to R to retrieve data in chunks.  (bigglm uses chunks of rows, 
but chunks of columns may be more efficient.)

Another possibility is that you want to fit a log-linear model to purely 
categorical data, and could make use of loglin().  That will be more 
efficient if the contingency table is densely populated.

My experience suggests that the important issues here are likely to be 
statistical rather than computational, and this is more a topic for a 
consultant than volunteer help on a discussion list.


On Sat, 20 Jan 2007, WILLIE, JILL wrote:

 We are wanting to use R instead of/in addition to our existing stats
 package because of it's huge assortment of stat functions.  But, we
 routinely need to fit GLM models to files that are approximately 2-4GB
 (as SQL tables, un-indexed, w/tinyint-sized fields except for the
 response  weight variables).  Is this feasible, does anybody know,
 given sufficient hardware, using R?  It appears to use a great deal of
 memory on the small files I've tested.

 I've read the data import, memory.limit, memory.size  general
 documentation but can't seem to find a way to tell what the boundaries
 are  roughly gauge the needed memory...other than trial  error.  I've
 started by testing the data.frame  run out of memory on my PC.  I'm new
 to R so please be forgiving if this is a poorly-worded question.

 Jill Willie
 Open Seas
 Safeco Insurance
 [EMAIL PROTECTED]
 206-545-5673


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Can we do GLM on 2GB data set with R?

2007-01-20 Thread WILLIE, JILL
We are wanting to use R instead of/in addition to our existing stats
package because of it's huge assortment of stat functions.  But, we
routinely need to fit GLM models to files that are approximately 2-4GB
(as SQL tables, un-indexed, w/tinyint-sized fields except for the
response  weight variables).  Is this feasible, does anybody know,
given sufficient hardware, using R?  It appears to use a great deal of
memory on the small files I've tested.

I've read the data import, memory.limit, memory.size  general
documentation but can't seem to find a way to tell what the boundaries
are  roughly gauge the needed memory...other than trial  error.  I've
started by testing the data.frame  run out of memory on my PC.  I'm new
to R so please be forgiving if this is a poorly-worded question.

Jill Willie 
Open Seas
Safeco Insurance
[EMAIL PROTECTED] 
206-545-5673

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.