Re: [R] Memory Problems with CSV and Survey Objects

2009-10-25 Thread tlumley

On Sat, 24 Oct 2009, Carlos J. Gil Bellosta wrote:


Hello,

Adding to Thomas' email, you could also use package colbycol which
allows you to load into R files that a simple read.table cannot cope
with, study columns independently, select those you are more interested
in and, finally, set up a dataframe with just the columns you are
interested in.

It is just the same strategy Thomas suggested, only that without the
requirement of an external tool and using almost the same syntax as you
would use in case you had no memory problems.


I'm not sure that this has any less requirement for an external tool.  Both 
approaches require downloading an R package from CRAN. RSQLite requires SQLite, 
but that is included in the package. colbycol requires Java (via rJava), which 
isn't included in the package, but is already present on many machines.

 -thomas



Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com



On Fri, 2009-10-23 at 09:36 -0400, Anthony Damico wrote:

I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm
hitting a memory error when I try to store the data frame into a survey
design object, the R object that stores data for complex sample survey data.

When I launch R, I execute the following line from Windows:
C:\Program Files\R\R-2.9.1\bin\Rgui.exe --max-mem-size=2047M
Anything higher, and I get an error message saying the maximum has been set
to 2047M.

Here are the commands:

library(survey)


#this step takes more than five minutes

data08-read.csv(data08.csv,header=TRUE,nrows=210437)



object.size(data08)

#329877112 bytes

#Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K


brr.dsgn -svrepdesign( data = data08 , repweights = data08[, grep(

^repwgt , colnames( data08)) ], type = BRR , combined.weights = TRUE ,
weights = data08$mainwgt )
#Error: cannot allocate vector of size 254.5 Mb

#The survey design object does not get created.

#This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K

#And here are some memory diagnostics

memory.limit()

[1] 2047

memory.size()

[1] 1449.06

gc()

   used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   131148   3.6 593642   15.9  15680924  418.8
Vcells 45479988 347.0  173526492 1324.0 220358611 1681.3

A description of the survey package can be found here:
http://faculty.washington.edu/tlumley/survey/

I tried creating a work-around by using the database-backed survey objects
(DB SO), included in the survey package to conserve memory on larger
datasets like this one.  Unfortunately, I don't think the survey package
supports database connections for replicate weight designs yet, since I've
only been able to get a database connection working after creating a
svydesign object and not a svrepdesign object - and also because neither the
DB SO website nor the svrepdesign help page make any mention of those
parameters.

The DB SOs are described in detail here:
http://faculty.washington.edu/tlumley/survey/svy-dbi.html

Any advice would be truly appreciated.

Thanks,
 Anthony Damico

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory Problems with CSV and Survey Objects

2009-10-25 Thread Gabor Grothendieck
Note that read.csv.sql in the sqldf package could be used to avoid
most of the setup:

library(sqldf)
DF - read.csv.sql(myfile.csv, sql = select ...)

It will setup the database, read the file into it, apply the select
statement, place the result into data frame DF and destroy the
database all in one line.  See Example 13 on the home page:
http://sqldf.googlecode.com

On Fri, Oct 23, 2009 at 1:24 PM,  tlum...@u.washington.edu wrote:


 Yes, a 350Mb data frame is a bit big for 32-bit R to handle conveniently.

 As you note, the survey package doesn't yet do database-backed
 replicate-weight designs. You can get the same effect yourself without too
 much work.

 First, put the data into a database, such as SQLite.  If you have the data
 frame read in then dbWriteTable will do it.

 Now, drop most of the variables, keeping the sampling weights, replicate
 weights, and a couple of other variables.

 Create a svrepdesign() with the reduced data set.

 When you want to do an analysis, use dbGetQuery() to load the variables you
 need for the analysis, and put them in the $variables component of the
 svrepdesign.

 That's exactly what the database-backed functions do for svydesign objects.

 [If you only ever want to use a small subset of the variables, it's even
 easier: drop all the extraneous variables and create a svrepdesign with the
 variables you want]

       -thomas

 On Fri, 23 Oct 2009, Anthony Damico wrote:

 I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm
 hitting a memory error when I try to store the data frame into a survey
 design object, the R object that stores data for complex sample survey
 data.

 When I launch R, I execute the following line from Windows:
 C:\Program Files\R\R-2.9.1\bin\Rgui.exe --max-mem-size=2047M
 Anything higher, and I get an error message saying the maximum has been
 set
 to 2047M.

 Here are the commands:

 library(survey)

 #this step takes more than five minutes

 data08-read.csv(data08.csv,header=TRUE,nrows=210437)

 object.size(data08)

 #329877112 bytes

 #Looking at Windows Task Manager, Mem Usage for Rgui.exe is already
 659,632K

 brr.dsgn -svrepdesign( data = data08 , repweights = data08[, grep(

 ^repwgt , colnames( data08)) ], type = BRR , combined.weights = TRUE ,
 weights = data08$mainwgt )
 #Error: cannot allocate vector of size 254.5 Mb

 #The survey design object does not get created.

 #This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K

 #And here are some memory diagnostics

 memory.limit()

 [1] 2047

 memory.size()

 [1] 1449.06

 gc()

          used  (Mb) gc trigger   (Mb)  max used   (Mb)
 Ncells   131148   3.6     593642   15.9  15680924  418.8
 Vcells 45479988 347.0  173526492 1324.0 220358611 1681.3

 A description of the survey package can be found here:
 http://faculty.washington.edu/tlumley/survey/

 I tried creating a work-around by using the database-backed survey objects
 (DB SO), included in the survey package to conserve memory on larger
 datasets like this one.  Unfortunately, I don't think the survey package
 supports database connections for replicate weight designs yet, since I've
 only been able to get a database connection working after creating a
 svydesign object and not a svrepdesign object - and also because neither
 the
 DB SO website nor the svrepdesign help page make any mention of those
 parameters.

 The DB SOs are described in detail here:
 http://faculty.washington.edu/tlumley/survey/svy-dbi.html

 Any advice would be truly appreciated.

 Thanks,
 Anthony Damico

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 Thomas Lumley                   Assoc. Professor, Biostatistics
 tlum...@u.washington.edu        University of Washington, Seattle

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory Problems with CSV and Survey Objects

2009-10-24 Thread Carlos J. Gil Bellosta
Hello,

Adding to Thomas' email, you could also use package colbycol which
allows you to load into R files that a simple read.table cannot cope
with, study columns independently, select those you are more interested
in and, finally, set up a dataframe with just the columns you are
interested in.

It is just the same strategy Thomas suggested, only that without the
requirement of an external tool and using almost the same syntax as you
would use in case you had no memory problems.

Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com



On Fri, 2009-10-23 at 09:36 -0400, Anthony Damico wrote:
 I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm
 hitting a memory error when I try to store the data frame into a survey
 design object, the R object that stores data for complex sample survey data.
 
 When I launch R, I execute the following line from Windows:
 C:\Program Files\R\R-2.9.1\bin\Rgui.exe --max-mem-size=2047M
 Anything higher, and I get an error message saying the maximum has been set
 to 2047M.
 
 Here are the commands:
  library(survey)
 
 #this step takes more than five minutes
  data08-read.csv(data08.csv,header=TRUE,nrows=210437)
 
  object.size(data08)
 #329877112 bytes
 
 #Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K
 
  brr.dsgn -svrepdesign( data = data08 , repweights = data08[, grep(
 ^repwgt , colnames( data08)) ], type = BRR , combined.weights = TRUE ,
 weights = data08$mainwgt )
 #Error: cannot allocate vector of size 254.5 Mb
 
 #The survey design object does not get created.
 
 #This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K
 
 #And here are some memory diagnostics
  memory.limit()
 [1] 2047
  memory.size()
 [1] 1449.06
  gc()
used  (Mb) gc trigger   (Mb)  max used   (Mb)
 Ncells   131148   3.6 593642   15.9  15680924  418.8
 Vcells 45479988 347.0  173526492 1324.0 220358611 1681.3
 
 A description of the survey package can be found here:
 http://faculty.washington.edu/tlumley/survey/
 
 I tried creating a work-around by using the database-backed survey objects
 (DB SO), included in the survey package to conserve memory on larger
 datasets like this one.  Unfortunately, I don't think the survey package
 supports database connections for replicate weight designs yet, since I've
 only been able to get a database connection working after creating a
 svydesign object and not a svrepdesign object - and also because neither the
 DB SO website nor the svrepdesign help page make any mention of those
 parameters.
 
 The DB SOs are described in detail here:
 http://faculty.washington.edu/tlumley/survey/svy-dbi.html
 
 Any advice would be truly appreciated.
 
 Thanks,
  Anthony Damico
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Memory Problems with CSV and Survey Objects

2009-10-23 Thread Anthony Damico
I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm
hitting a memory error when I try to store the data frame into a survey
design object, the R object that stores data for complex sample survey data.

When I launch R, I execute the following line from Windows:
C:\Program Files\R\R-2.9.1\bin\Rgui.exe --max-mem-size=2047M
Anything higher, and I get an error message saying the maximum has been set
to 2047M.

Here are the commands:
 library(survey)

#this step takes more than five minutes
 data08-read.csv(data08.csv,header=TRUE,nrows=210437)

 object.size(data08)
#329877112 bytes

#Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K

 brr.dsgn -svrepdesign( data = data08 , repweights = data08[, grep(
^repwgt , colnames( data08)) ], type = BRR , combined.weights = TRUE ,
weights = data08$mainwgt )
#Error: cannot allocate vector of size 254.5 Mb

#The survey design object does not get created.

#This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K

#And here are some memory diagnostics
 memory.limit()
[1] 2047
 memory.size()
[1] 1449.06
 gc()
   used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   131148   3.6 593642   15.9  15680924  418.8
Vcells 45479988 347.0  173526492 1324.0 220358611 1681.3

A description of the survey package can be found here:
http://faculty.washington.edu/tlumley/survey/

I tried creating a work-around by using the database-backed survey objects
(DB SO), included in the survey package to conserve memory on larger
datasets like this one.  Unfortunately, I don't think the survey package
supports database connections for replicate weight designs yet, since I've
only been able to get a database connection working after creating a
svydesign object and not a svrepdesign object - and also because neither the
DB SO website nor the svrepdesign help page make any mention of those
parameters.

The DB SOs are described in detail here:
http://faculty.washington.edu/tlumley/survey/svy-dbi.html

Any advice would be truly appreciated.

Thanks,
 Anthony Damico

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory Problems with CSV and Survey Objects

2009-10-23 Thread tlumley



Yes, a 350Mb data frame is a bit big for 32-bit R to handle conveniently.

As you note, the survey package doesn't yet do database-backed replicate-weight 
designs. You can get the same effect yourself without too much work.

First, put the data into a database, such as SQLite.  If you have the data 
frame read in then dbWriteTable will do it.

Now, drop most of the variables, keeping the sampling weights, replicate 
weights, and a couple of other variables.

Create a svrepdesign() with the reduced data set.

When you want to do an analysis, use dbGetQuery() to load the variables you 
need for the analysis, and put them in the $variables component of the 
svrepdesign.

That's exactly what the database-backed functions do for svydesign objects.

[If you only ever want to use a small subset of the variables, it's even 
easier: drop all the extraneous variables and create a svrepdesign with the 
variables you want]

   -thomas

On Fri, 23 Oct 2009, Anthony Damico wrote:


I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm
hitting a memory error when I try to store the data frame into a survey
design object, the R object that stores data for complex sample survey data.

When I launch R, I execute the following line from Windows:
C:\Program Files\R\R-2.9.1\bin\Rgui.exe --max-mem-size=2047M
Anything higher, and I get an error message saying the maximum has been set
to 2047M.

Here are the commands:

library(survey)


#this step takes more than five minutes

data08-read.csv(data08.csv,header=TRUE,nrows=210437)



object.size(data08)

#329877112 bytes

#Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K


brr.dsgn -svrepdesign( data = data08 , repweights = data08[, grep(

^repwgt , colnames( data08)) ], type = BRR , combined.weights = TRUE ,
weights = data08$mainwgt )
#Error: cannot allocate vector of size 254.5 Mb

#The survey design object does not get created.

#This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K

#And here are some memory diagnostics

memory.limit()

[1] 2047

memory.size()

[1] 1449.06

gc()

  used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   131148   3.6 593642   15.9  15680924  418.8
Vcells 45479988 347.0  173526492 1324.0 220358611 1681.3

A description of the survey package can be found here:
http://faculty.washington.edu/tlumley/survey/

I tried creating a work-around by using the database-backed survey objects
(DB SO), included in the survey package to conserve memory on larger
datasets like this one.  Unfortunately, I don't think the survey package
supports database connections for replicate weight designs yet, since I've
only been able to get a database connection working after creating a
svydesign object and not a svrepdesign object - and also because neither the
DB SO website nor the svrepdesign help page make any mention of those
parameters.

The DB SOs are described in detail here:
http://faculty.washington.edu/tlumley/survey/svy-dbi.html

Any advice would be truly appreciated.

Thanks,
Anthony Damico

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.