Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-09 Thread François Pinard
[hadley wickham]

[François Pinard]

 Selecting a sample is easy.  Yet, I'm not aware of any SQL device for
 easily selecting a _random_ sample of the records of a given table.
 On the other hand, I'm no SQL specialist, others might know better.

There are a number of such devices, which tend to be rather SQL variant
specific.  Try googling for select random rows mysql, select random
rows pgsql, etc.

Thanks as well for these hints.  Googling around as your suggested (yet 
keeping my eyes in the MySQL direction, because this is what we use), 
getting MySQL itself to do the selection is a bit discouraging, as 
according to comments I've read, MySQL does not seem to scale well with 
the database size according to the comments I've read, especially when 
records have to be decorated with random numbers and later sorted.

Yet, I did not drive any benchmark myself, and would not blindly take 
everything I read for granted, given that MySQL developers have speed in 
mind, and there are ways to interrupt a sort before running it to full 
completion, when only a few sorted records are wanted.

Another possibility is to generate a large table of randomly
distributed ids and then use that (with randomly generated limits) to
select the appropriate number of records.

I'm not sure I understand your idea (what mixes me in the randomly 
generated limits part).  If the large table is much larger than the 
size of the wanted sample, we might not be gaining much.

Just for fun: here, sample(1, 10) in R is slowish already :-).

All in all, if I ever have such a problem, a practical solution probably 
has to be outside of R, and maybe outside SQL as well.

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-09 Thread r . ghezzo
I found Reservoir-Sampling algorithms of time complexity O(n(1+log(N/n))) by
Kim-Hung Li , ACM Transactions on Mathematical Software Vol 20 No 4 Dec 94
p481-492.
He mentions algorithm Z and K and proposed 2 improved versions alg L and M.
Algorith L is really easy to implement but relatively slow, M doesn't look very
difficult and is the fastest.
Heberto Ghezzo
McGill University
Montreal - Canada

Quoting François Pinard [EMAIL PROTECTED]:

 [Martin Maechler]

 FrPi Suppose the file (or tape) holds N records (N is not known
 FrPi in advance), from which we want a sample of M records at
 FrPi most. [...] If the algorithm is carefully designed, when
 FrPi the last (N'th) record of the file will have been processed
 FrPi this way, we may then have M records randomly selected from
 FrPi N records, in such a a way that each of the N records had an
 FrPi equal probability to end up in the selection of M records.  I
 FrPi may seek out for details if needed.

 [...] I'm also intrigued about the details of the algorithm you
 outline above.

 I went into my old SPSS books and related references to find it for you,
 to no avail (yet I confess I did not try very hard).  I vaguely remember
 it was related to Spearman's correlation computation: I did find notes
 about the severe memory limitation of this computation, but nothing
 about the implemented workaround.  I did find other sampling devices,
 but not the very one I remember having read about, many years ago.

 On the other hand, Googling tells that this topic has been much studied,
 and that Vitter's algorithm Z seems to be popular nowadays (even if not
 the simplest) because it is more efficient than others.  Google found
 a copy of the paper:

http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf

 Here is an implementation for Postgres:

http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php

 yet I do not find it very readable -- but this is only an opinion: I'm
 rather demanding in the area of legibility, while many or most people
 are more courageous than me! :-).

 --
 François Pinard   http://pinard.progiciels-bpi.ca

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-08 Thread François Pinard
[Brian Ripley]
[François Pinard]
[Brian Ripley]

One problem [...] is that R's I/O is not line-oriented but
stream-oriented.  So selecting lines is not particularly easy in R.

I understand that you mean random access to lines, instead of random
selection of lines.

That was not my point. [...] Skipping lines you do not need will take 
longer than you might guess (based on some limited experience).

Thanks for telling (and also for the expression reservoir sampling).
OK, then.  All summarized, if I ever need this for bigger datasets, 
selection might better be done outside of R.

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-08 Thread François Pinard
[Martin Maechler]

FrPi Suppose the file (or tape) holds N records (N is not known
FrPi in advance), from which we want a sample of M records at
FrPi most. [...] If the algorithm is carefully designed, when
FrPi the last (N'th) record of the file will have been processed
FrPi this way, we may then have M records randomly selected from
FrPi N records, in such a a way that each of the N records had an
FrPi equal probability to end up in the selection of M records.  I
FrPi may seek out for details if needed.

[...] I'm also intrigued about the details of the algorithm you
outline above.

I went into my old SPSS books and related references to find it for you, 
to no avail (yet I confess I did not try very hard).  I vaguely remember 
it was related to Spearman's correlation computation: I did find notes 
about the severe memory limitation of this computation, but nothing 
about the implemented workaround.  I did find other sampling devices, 
but not the very one I remember having read about, many years ago.

On the other hand, Googling tells that this topic has been much studied, 
and that Vitter's algorithm Z seems to be popular nowadays (even if not 
the simplest) because it is more efficient than others.  Google found 
a copy of the paper:

   http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf

Here is an implementation for Postgres: 

   http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php

yet I do not find it very readable -- but this is only an opinion: I'm 
rather demanding in the area of legibility, while many or most people 
are more courageous than me! :-).

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-08 Thread hadley wickham
 Thanks as well for these hints.  Googling around as your suggested (yet
 keeping my eyes in the MySQL direction, because this is what we use),
 getting MySQL itself to do the selection is a bit discouraging, as
 according to comments I've read, MySQL does not seem to scale well with
 the database size according to the comments I've read, especially when
 records have to be decorated with random numbers and later sorted.

With SQL there is always a way to do what you want quickly, but you
need to think carefully about what operations are most common in your
database.  For example, the problem is much easier if you can assume
that the rows are numbered sequentially from 1 to n.  This could be
enfored using a trigger whenever a record is added/deleted.  This
would slow insertions/deletions but speed selects.

 Just for fun: here, sample(1, 10) in R is slowish already :-).

This is another example where greater knowledge of problem can yield
speed increases.  Here (where the number of selections is much smaller
than the total number of objects) you are better off generating 10
numbers with runif(10, 0, 100) and then checking that they are
unique

 Another possibility is to generate a large table of randomly
 distributed ids and then use that (with randomly generated limits) to
 select the appropriate number of records.

 I'm not sure I understand your idea (what mixes me in the randomly
 generated limits part).  If the large table is much larger than the
 size of the wanted sample, we might not be gaining much.

Think about using a table of random numbers.  They are pregenerated
for you, you just choose a starting and ending index.  It will be slow
to generate the table the first time, but then it will be fast.  It
will also take up quite a bit of space, but space is cheap (and time
is not!)

Hadley

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-08 Thread François Pinard
[hadley wickham]

 [...] according to comments I've read, MySQL does not seem to scale
 well with the database size according to the comments I've read,
 especially when records have to be decorated with random numbers and
 later sorted.

With SQL there is always a way to do what you want quickly, but you
need to think carefully about what operations are most common in your
database.  For example, the problem is much easier if you can assume
that the rows are numbered sequentially from 1 to n.  This could be
enfored using a trigger whenever a record is added/deleted.  This would
slow insertions/deletions but speed selects.

Sure (for a caricature example) that if database records are already 
decorated with random numbers, and an index is built over the 
decoration, random sampling may indeed be done quicker :-). The fact is 
that (at least our) databases are not especially designed for random 
sampling, and people in charge would resist redesigning them merely 
because there would be a few needs for random sampling.

What would be ideal is being able to build random samples out of any big 
database or file, with equal ease.  The fact is that it's doable.  
(Brian Ripley points out that R textual I/O has too much overhead for 
being usable, so one should rather say, sadly: It's doable outside R.)

 Just for fun: here, sample(1, 10) in R is slowish already
 :-).

This is another example where greater knowledge of problem can yield
speed increases.  Here (where the number of selections is much smaller
than the total number of objects) you are better off generating 10
numbers with runif(10, 0, 100) and then checking that they are
unique

Of course, my remark about sample() is related to the previous 
discussion.  If sample(N, M) was more on the O(M) side than being on 
the O(N) side (both memory-wise and cpu-wise), it could be used for
preselecting which rows of a big database to include in a random sample, 
so building on your idea of using a set of IDs.  As the sample of 
M records will have to be processed in-memory by R anyway, computing 
a vector of M indices does not (or should not) increase complexity.

However, sample(N, M) is likely less usable for randomly sampling 
a database, if it is O(N) to start with.  About your suggestion of using 
runif and later checking uniqueness, sample() could well be 
implemented this way, when the arguments are proper.  The greater 
knowledge of the problem could be built in right into the routine meant 
to solve it.  sample(N, M) could even know how to take advantage of 
some simplified case of a reservoir sampling technique :-).

 [...] a large table of randomly distributed ids [...] (with randomly
 generated limits) to select the appropriate number of records.

[...] a table of random numbers [...] pregenerated for you, you just
choose a starting and ending index.  It will be slow to generate the
table the first time, but then it will be fast.  It will also take up
quite a bit of space, but space is cheap (and time is not!)

Thanks for the explanation.

In the case under consideration here (random sampling of a big file or 
database), I would be tempted to guess that the time required for 
generating pseudo-random numbers is negligible when compared to the 
overall input/output time, so it might be that pregenerating randomized 
IDs is not worth the trouble.  Also given that whenever the database 
size changes, the list of pregenerated IDs is not valid anymore.

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-06 Thread Prof Brian Ripley
[Just one point extracted: Hadley Wickham has answered the random sample 
one]


On Thu, 5 Jan 2006, François Pinard wrote:


[Brian Ripley]

One problem with Francois Pinard's suggestion (the credit has got lost)
is that R's I/O is not line-oriented but stream-oriented.  So selecting
lines is not particularly easy in R.


I understand that you mean random access to lines, instead of random
selection of lines.  Once again, this chat comes out of reading someone
else's problem, this is not a problem I actually have.  SPSS was not
randomly accessing lines, as data files could well be hold on magnetic
tapes, where random access is not possible on average practice.  SPSS
reads (or was reading) lines sequentially from beginning to end, and the
_random_ sample is built while the reading goes.


That was not my point.  R's standard I/O is through connections, which 
allow for pushbacks, changing line endings and re-encoding character sets. 
That does add overhead compared to C/Fortran line-buffered reading of a 
file.  Skipping lines you do not need will take longer than you might 
guess (based on some limited experience).


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-06 Thread Martin Maechler
 FrPi == François Pinard [EMAIL PROTECTED]
 on Thu, 5 Jan 2006 22:41:21 -0500 writes:

FrPi [Brian Ripley]
 I rather thought that using a DBMS was standard practice in the 
 R community for those using large datasets: it gets discussed rather 
 often.

FrPi Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

 Another possibility is to make use of the several DBMS interfaces 
already 
 available for R.  It is very easy to pull in a sample from one of those, 
 and surely keeping such large data files as ASCII not good practice.

FrPi Selecting a sample is easy.  Yet, I'm not aware of any
FrPi SQL device for easily selecting a _random_ sample of
FrPi the records of a given table.  On the other hand, I'm
FrPi no SQL specialist, others might know better.

FrPi We do not have a need yet for samples where I work,
FrPi but if we ever need such, they will have to be random,
FrPi or else, I will always fear biases.

 One problem with Francois Pinard's suggestion (the credit has got lost) 
 is that R's I/O is not line-oriented but stream-oriented.  So selecting 
 lines is not particularly easy in R.

FrPi I understand that you mean random access to lines,
FrPi instead of random selection of lines.  Once again,
FrPi this chat comes out of reading someone else's problem,
FrPi this is not a problem I actually have.  SPSS was not
FrPi randomly accessing lines, as data files could well be
FrPi hold on magnetic tapes, where random access is not
FrPi possible on average practice.  SPSS reads (or was
FrPi reading) lines sequentially from beginning to end, and
FrPi the _random_ sample is built while the reading goes.

FrPi Suppose the file (or tape) holds N records (N is not
FrPi known in advance), from which we want a sample of M
FrPi records at most.  If N = M, then we use the whole
FrPi file, no sampling is possible nor necessary.
FrPi Otherwise, we first initialise M records with the
FrPi first M records of the file.  Then, for each record in
FrPi the file after the M'th, the algorithm has to decide
FrPi if the record just read will be discarded or if it
FrPi will replace one of the M records already saved, and
FrPi in the latter case, which of those records will be
FrPi replaced.  If the algorithm is carefully designed,
FrPi when the last (N'th) record of the file will have been
FrPi processed this way, we may then have M records
FrPi randomly selected from N records, in such a a way that
FrPi each of the N records had an equal probability to end
FrPi up in the selection of M records.  I may seek out for
FrPi details if needed.

FrPi This is my suggestion, or in fact, more a thought that
FrPi a suggestion.  It might represent something useful
FrPi either for flat ASCII files or even for a stream of
FrPi records coming out of a database, if those effectively
FrPi do not offer ready random sampling devices.


FrPi P.S. - In the (rather unlikely, I admit) case the gang
FrPi I'm part of would have the need described above, and
FrPi if I then dared implementing it myself, would it be welcome?

I think this would be a very interesting tool and
I'm also intrigued about the details of the algorithm you
outline above.

If it would be made to work on all kind of read.table()-readable
files, (i.e. of course including *.csv);   that might be a valuable
tool for all those -- and there are many -- for whom working
with DBMs is too daunting initially.

Martin Maechler, ETH Zurich

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-06 Thread Prof Brian Ripley

On Fri, 6 Jan 2006, Martin Maechler wrote:


FrPi == François Pinard [EMAIL PROTECTED]
on Thu, 5 Jan 2006 22:41:21 -0500 writes:


   FrPi [Brian Ripley]
I rather thought that using a DBMS was standard practice in the
R community for those using large datasets: it gets discussed rather
often.

   FrPi Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

Another possibility is to make use of the several DBMS interfaces already
available for R.  It is very easy to pull in a sample from one of those,
and surely keeping such large data files as ASCII not good practice.

   FrPi Selecting a sample is easy.  Yet, I'm not aware of any
   FrPi SQL device for easily selecting a _random_ sample of
   FrPi the records of a given table.  On the other hand, I'm
   FrPi no SQL specialist, others might know better.

   FrPi We do not have a need yet for samples where I work,
   FrPi but if we ever need such, they will have to be random,
   FrPi or else, I will always fear biases.

One problem with Francois Pinard's suggestion (the credit has got lost)
is that R's I/O is not line-oriented but stream-oriented.  So selecting
lines is not particularly easy in R.

   FrPi I understand that you mean random access to lines,
   FrPi instead of random selection of lines.  Once again,
   FrPi this chat comes out of reading someone else's problem,
   FrPi this is not a problem I actually have.  SPSS was not
   FrPi randomly accessing lines, as data files could well be
   FrPi hold on magnetic tapes, where random access is not
   FrPi possible on average practice.  SPSS reads (or was
   FrPi reading) lines sequentially from beginning to end, and
   FrPi the _random_ sample is built while the reading goes.

   FrPi Suppose the file (or tape) holds N records (N is not
   FrPi known in advance), from which we want a sample of M
   FrPi records at most.  If N = M, then we use the whole
   FrPi file, no sampling is possible nor necessary.
   FrPi Otherwise, we first initialise M records with the
   FrPi first M records of the file.  Then, for each record in
   FrPi the file after the M'th, the algorithm has to decide
   FrPi if the record just read will be discarded or if it
   FrPi will replace one of the M records already saved, and
   FrPi in the latter case, which of those records will be
   FrPi replaced.  If the algorithm is carefully designed,
   FrPi when the last (N'th) record of the file will have been
   FrPi processed this way, we may then have M records
   FrPi randomly selected from N records, in such a a way that
   FrPi each of the N records had an equal probability to end
   FrPi up in the selection of M records.  I may seek out for
   FrPi details if needed.

   FrPi This is my suggestion, or in fact, more a thought that
   FrPi a suggestion.  It might represent something useful
   FrPi either for flat ASCII files or even for a stream of
   FrPi records coming out of a database, if those effectively
   FrPi do not offer ready random sampling devices.


   FrPi P.S. - In the (rather unlikely, I admit) case the gang
   FrPi I'm part of would have the need described above, and
   FrPi if I then dared implementing it myself, would it be welcome?

I think this would be a very interesting tool and
I'm also intrigued about the details of the algorithm you
outline above.


It's called `reservoir sampling' and is described in my simulation book 
and Knuth and elsewhere.



If it would be made to work on all kind of read.table()-readable
files, (i.e. of course including *.csv);   that might be a valuable
tool for all those -- and there are many -- for whom working
with DBMs is too daunting initially.


It would be better (for the reasons I gave) to do this in a separate file 
preprocessor: read.table reads from a connection not a file, of course.


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-06 Thread Wensui Liu
RG,

Actually, SQLite provides a solution to read *.csv file directly into db.

Just for your consideration.

On 1/5/06, ronggui [EMAIL PROTECTED] wrote:

 2006/1/6, jim holtman [EMAIL PROTECTED]:
  If what you are reading in is numeric data, then it would require (807 *
  118519 * 8) 760MB just to store a single copy of the object -- more
 memory
  than you have on your computer.  If you were reading it in, then the
 problem
  is the paging that was occurring.
 In fact,If I read it in 3 pieces, each is about 170M.

 
  You have to look at storing this in a database and working on a subset
 of
  the data.  Do you really need to have all 807 variables in memory at the
  same time?

 Yip,I don't need all the variables.But I don't know how to get the
 necessary  variables into R.

 At last I  read the data in piece and use RSQLite package to write it
 to a database.and do then do the analysis. If i am familiar with
 database software, using database (and R) is the best choice,but
 convert the file into database format is not an easy job for me.I ask
 for help in SQLite list,but the solution is not satisfying as that
 required the knowledge about the third script language.After searching
 the internet,I get this solution:

 #begin
 rm(list=ls())
 f-file(D:\wvsevs_sb_v4.csv,r)
 i - 0
 done - FALSE
 library(RSQLite)
 con-dbConnect(SQLite,c:\sqlite\database.db3)
 tim1-Sys.time()

 while(!done){
 i-i+1
 tt-readLines(f,2500)
 if (length(tt)2500) done - TRUE
 tt-textConnection(tt)
 if (i==1) {
assign(dat,read.table(tt,head=T,sep=,,quote=));
  }
 else assign(dat,read.table(tt,head=F,sep=,,quote=))
 close(tt)
 ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T),
   dbWriteTable(con,wvs,dat) )
 }
 close(f)
 #end
 It's not the best solution,but it works.



  If you use 'scan', you could specify that you do not want some of the
  variables read in so it might make a more reasonably sized objects.
 
 
  On 1/5/06, François Pinard [EMAIL PROTECTED] wrote:
   [ronggui]
  
   R's week when handling large data file.  I has a data file : 807
 vars,
   118519 obs.and its CVS format.  Stata can read it in in 2 minus,but
 In
   my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
  
   Just (another) thought.  I used to use SPSS, many, many years ago, on
   CDC machines, where the CPU had limited memory and no kind of paging
   architecture.  Files did not need to be very large for being too
 large.
  
   SPSS had a feature that was then useful, about the capability of
   sampling a big dataset directly at file read time, quite before
   processing starts.  Maybe something similar could help in R (that is,
   instead of reading the whole data in memory, _then_ sampling it.)
  
   One can read records from a file, up to a preset amount of them.  If
 the
   file happens to contain more records than that preset number (the
 number
   of records in the whole file is not known beforehand), already read
   records may be dropped at random and replaced by other records coming
   from the file being read.  If the random selection algorithm is
 properly
   chosen, it can be made so that all records in the original file have
   equal probability of being kept in the final subset.
  
   If such a sampling facility was built right within usual R reading
   routines (triggered by an extra argument, say), it could offer
   a compromise for processing large files, and also sometimes accelerate
   computations for big problems, even when memory is not at stake.
  
   --
   François Pinard   http://pinard.progiciels-bpi.ca
  
   __
   R-help@stat.math.ethz.ch mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide!
  http://www.R-project.org/posting-guide.html
  
 
 
 
  --
  Jim Holtman
  Cincinnati, OH
  +1 513 247 0281
 
  What the problem you are trying to solve?


 --
 黄荣贵
 Deparment of Sociology
 Fudan University

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html




--
WenSui Liu
(http://statcompute.blogspot.com)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness
Cincinnati Children Hospital Medical Center

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-06 Thread Wensui Liu
RG,

I think .import command in sqlite should work. plus, sqlite browser (
http://sqlitebrowser.sourceforge.net) might do the work as well.

On 1/6/06, ronggui [EMAIL PROTECTED] wrote:

 Can you give me some hints? or let me know how to do ?

 Thank you !

 2006/1/6, Wensui Liu [EMAIL PROTECTED]:
  RG,
 
   Actually, SQLite provides a solution to read *.csv file directly into
 db.
 
   Just for your consideration.
 
 
  On 1/5/06, ronggui [EMAIL PROTECTED] wrote:
   2006/1/6, jim holtman [EMAIL PROTECTED]:
If what you are reading in is numeric data, then it would require
 (807 *
118519 * 8) 760MB just to store a single copy of the object -- more
  memory
than you have on your computer.  If you were reading it in, then the
  problem
is the paging that was occurring.
   In fact,If I read it in 3 pieces, each is about 170M.
  
   
You have to look at storing this in a database and working on a
 subset
  of
the data.  Do you really need to have all 807 variables in memory at
 the
same time?
  
   Yip,I don't need all the variables.But I don't know how to get the
   necessary  variables into R.
  
   At last I  read the data in piece and use RSQLite package to write it
   to a database.and do then do the analysis. If i am familiar with
   database software, using database (and R) is the best choice,but
   convert the file into database format is not an easy job for me.I ask
   for help in SQLite list,but the solution is not satisfying as that
   required the knowledge about the third script language.After searching
   the internet,I get this solution:
  
   #begin
   rm(list=ls())
   f-file(D:\wvsevs_sb_v4.csv,r)
   i - 0
   done - FALSE
   library(RSQLite)
   con-dbConnect(SQLite,c:\sqlite\database.db3)
   tim1-Sys.time()
  
   while(!done){
   i-i+1
   tt-readLines(f,2500)
   if (length(tt)2500) done - TRUE
   tt-textConnection(tt)
   if (i==1) {
  assign(dat,read.table(tt,head=T,sep=,,quote=));
}
   else assign(dat,read.table(tt,head=F,sep=,,quote=))
   close(tt)
   ifelse(dbExistsTable(con,
  wvs),dbWriteTable(con,wvs,dat,append=T),
 dbWriteTable(con,wvs,dat) )
   }
   close(f)
   #end
   It's not the best solution,but it works.
  
  
  
If you use 'scan', you could specify that you do not want some of
 the
variables read in so it might make a more reasonably sized objects.
   
   
On 1/5/06, François Pinard  [EMAIL PROTECTED] wrote:
 [ronggui]

 R's week when handling large data file.  I has a data file : 807
  vars,
 118519 obs.and its CVS format.  Stata can read it in in 2
 minus,but
  In
 my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.

 Just (another) thought.  I used to use SPSS, many, many years ago,
 on
 CDC machines, where the CPU had limited memory and no kind of
 paging
 architecture.  Files did not need to be very large for being too
  large.

 SPSS had a feature that was then useful, about the capability of
 sampling a big dataset directly at file read time, quite before
 processing starts.  Maybe something similar could help in R (that
 is,
 instead of reading the whole data in memory, _then_ sampling it.)

 One can read records from a file, up to a preset amount of
 them.  If
  the
 file happens to contain more records than that preset number (the
  number
 of records in the whole file is not known beforehand), already
 read
 records may be dropped at random and replaced by other records
 coming
 from the file being read.  If the random selection algorithm is
  properly
 chosen, it can be made so that all records in the original file
 have
 equal probability of being kept in the final subset.

 If such a sampling facility was built right within usual R reading
 routines (triggered by an extra argument, say), it could offer
 a compromise for processing large files, and also sometimes
 accelerate
 computations for big problems, even when memory is not at stake.

 --
 François Pinard   http://pinard.progiciels-bpi.ca

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

   
   
   
--
Jim Holtman
Cincinnati, OH
+1 513 247 0281
   
What the problem you are trying to solve?
  
  
   --
   黄荣贵
   Deparment of Sociology
   Fudan University
  
   __
   R-help@stat.math.ethz.ch mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide!
  http://www.R-project.org/posting-guide.html
 
 
 
  --
  WenSui Liu
  (http://statcompute.blogspot.com)
  Senior Decision Support Analyst
  Health Policy and Clinical Effectiveness
  Cincinnati Children Hospital Medical Center
 


 --
 黄荣贵
 Deparment of Sociology
 Fudan 

[R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread François Pinard
[ronggui]

R's week when handling large data file.  I has a data file : 807 vars,
118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.

Just (another) thought.  I used to use SPSS, many, many years ago, on 
CDC machines, where the CPU had limited memory and no kind of paging 
architecture.  Files did not need to be very large for being too large.

SPSS had a feature that was then useful, about the capability of 
sampling a big dataset directly at file read time, quite before 
processing starts.  Maybe something similar could help in R (that is, 
instead of reading the whole data in memory, _then_ sampling it.)

One can read records from a file, up to a preset amount of them.  If the 
file happens to contain more records than that preset number (the number 
of records in the whole file is not known beforehand), already read 
records may be dropped at random and replaced by other records coming 
from the file being read.  If the random selection algorithm is properly 
chosen, it can be made so that all records in the original file have 
equal probability of being kept in the final subset.

If such a sampling facility was built right within usual R reading 
routines (triggered by an extra argument, say), it could offer 
a compromise for processing large files, and also sometimes accelerate 
computations for big problems, even when memory is not at stake.

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread Kort, Eric
 -Original Message-
 
 [ronggui]
 
 R's week when handling large data file.  I has a data file : 807 vars,
 118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
 my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
 
 Just (another) thought.  I used to use SPSS, many, many years ago, on
 CDC machines, where the CPU had limited memory and no kind of paging
 architecture.  Files did not need to be very large for being too large.
 
 SPSS had a feature that was then useful, about the capability of
 sampling a big dataset directly at file read time, quite before
 processing starts.  Maybe something similar could help in R (that is,
 instead of reading the whole data in memory, _then_ sampling it.)
 
 One can read records from a file, up to a preset amount of them.  If the
 file happens to contain more records than that preset number (the number
 of records in the whole file is not known beforehand), already read
 records may be dropped at random and replaced by other records coming
 from the file being read.  If the random selection algorithm is properly
 chosen, it can be made so that all records in the original file have
 equal probability of being kept in the final subset.
 
 If such a sampling facility was built right within usual R reading
 routines (triggered by an extra argument, say), it could offer
 a compromise for processing large files, and also sometimes accelerate
 computations for big problems, even when memory is not at stake.
 

Since I often work with images and other large data sets, I have been thinking 
about a BLOb (binary large object--though it wouldn't necessarily have to be 
binary) package for R--one that would handle I/O for such creatures and only 
bring as much data into the R space as was actually needed.

So I see 3 possibilities:

1. The sort of functionality you describe is implemented in the R internals (by 
people other than me).
2. Some individuals (perhaps myself included) write such a package.
3. This thread fizzles out and we do nothing.

I guess I will see what, if any, discussion ensues from this point to see which 
of these three options seems worth pursuing.

 --
 François Pinard   http://pinard.progiciels-bpi.ca
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-
 guide.html
This email message, including any attachments, is for the so...{{dropped}}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread Prof Brian Ripley
Another possibility is to make use of the several DBMS interfaces already 
available for R.  It is very easy to pull in a sample from one of those, 
and surely keeping such large data files as ASCII not good practice.


One problem with Francois Pinard's suggestion (the credit has got lost) is 
that R's I/O is not line-oriented but stream-oriented.  So selecting lines 
is not particularly easy in R.  That's a deliberate design decision, given 
the DBMS interfaces.


I rather thought that using a DBMS was standard practice in the R 
community for those using large datasets: it gets discussed rather often.


On Thu, 5 Jan 2006, Kort, Eric wrote:


-Original Message-

[ronggui]


R's week when handling large data file.  I has a data file : 807 vars,
118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.


Just (another) thought.  I used to use SPSS, many, many years ago, on
CDC machines, where the CPU had limited memory and no kind of paging
architecture.  Files did not need to be very large for being too large.

SPSS had a feature that was then useful, about the capability of
sampling a big dataset directly at file read time, quite before
processing starts.  Maybe something similar could help in R (that is,
instead of reading the whole data in memory, _then_ sampling it.)

One can read records from a file, up to a preset amount of them.  If the
file happens to contain more records than that preset number (the number
of records in the whole file is not known beforehand), already read
records may be dropped at random and replaced by other records coming
from the file being read.  If the random selection algorithm is properly
chosen, it can be made so that all records in the original file have
equal probability of being kept in the final subset.

If such a sampling facility was built right within usual R reading
routines (triggered by an extra argument, say), it could offer
a compromise for processing large files, and also sometimes accelerate
computations for big problems, even when memory is not at stake.



Since I often work with images and other large data sets, I have been thinking about a 
BLOb (binary large object--though it wouldn't necessarily have to be binary) 
package for R--one that would handle I/O for such creatures and only bring as much data 
into the R space as was actually needed.

So I see 3 possibilities:

1. The sort of functionality you describe is implemented in the R internals (by 
people other than me).
2. Some individuals (perhaps myself included) write such a package.
3. This thread fizzles out and we do nothing.

I guess I will see what, if any, discussion ensues from this point to see which 
of these three options seems worth pursuing.


--
François Pinard   http://pinard.progiciels-bpi.ca


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread jim holtman
If what you are reading in is numeric data, then it would require (807 *
118519 * 8) 760MB just to store a single copy of the object -- more memory
than you have on your computer.  If you were reading it in, then the problem
is the paging that was occurring.

You have to look at storing this in a database and working on a subset of
the data.  Do you really need to have all 807 variables in memory at the
same time?

If you use 'scan', you could specify that you do not want some of the
variables read in so it might make a more reasonably sized objects.


On 1/5/06, François Pinard [EMAIL PROTECTED] wrote:

 [ronggui]

 R's week when handling large data file.  I has a data file : 807 vars,
 118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
 my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.

 Just (another) thought.  I used to use SPSS, many, many years ago, on
 CDC machines, where the CPU had limited memory and no kind of paging
 architecture.  Files did not need to be very large for being too large.

 SPSS had a feature that was then useful, about the capability of
 sampling a big dataset directly at file read time, quite before
 processing starts.  Maybe something similar could help in R (that is,
 instead of reading the whole data in memory, _then_ sampling it.)

 One can read records from a file, up to a preset amount of them.  If the
 file happens to contain more records than that preset number (the number
 of records in the whole file is not known beforehand), already read
 records may be dropped at random and replaced by other records coming
 from the file being read.  If the random selection algorithm is properly
 chosen, it can be made so that all records in the original file have
 equal probability of being kept in the final subset.

 If such a sampling facility was built right within usual R reading
 routines (triggered by an extra argument, say), it could offer
 a compromise for processing large files, and also sometimes accelerate
 computations for big problems, even when memory is not at stake.

 --
 François Pinard   http://pinard.progiciels-bpi.ca

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html




--
Jim Holtman
Cincinnati, OH
+1 513 247 0281

What the problem you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread ronggui
2006/1/6, jim holtman [EMAIL PROTECTED]:
 If what you are reading in is numeric data, then it would require (807 *
 118519 * 8) 760MB just to store a single copy of the object -- more memory
 than you have on your computer.  If you were reading it in, then the problem
 is the paging that was occurring.
In fact,If I read it in 3 pieces, each is about 170M.


 You have to look at storing this in a database and working on a subset of
 the data.  Do you really need to have all 807 variables in memory at the
 same time?

Yip,I don't need all the variables.But I don't know how to get the
necessary  variables into R.

At last I  read the data in piece and use RSQLite package to write it
to a database.and do then do the analysis. If i am familiar with
database software, using database (and R) is the best choice,but
convert the file into database format is not an easy job for me.I ask
for help in SQLite list,but the solution is not satisfying as that
required the knowledge about the third script language.After searching
the internet,I get this solution:

#begin
rm(list=ls())
f-file(D:\wvsevs_sb_v4.csv,r)
i - 0
done - FALSE
library(RSQLite)
con-dbConnect(SQLite,c:\sqlite\database.db3)
tim1-Sys.time()

while(!done){
i-i+1
tt-readLines(f,2500)
if (length(tt)2500) done - TRUE
tt-textConnection(tt)
if (i==1) {
   assign(dat,read.table(tt,head=T,sep=,,quote=));
 }
else assign(dat,read.table(tt,head=F,sep=,,quote=))
close(tt)
ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T),
  dbWriteTable(con,wvs,dat) )
}
close(f)
#end
It's not the best solution,but it works.



 If you use 'scan', you could specify that you do not want some of the
 variables read in so it might make a more reasonably sized objects.


 On 1/5/06, François Pinard [EMAIL PROTECTED] wrote:
  [ronggui]
 
  R's week when handling large data file.  I has a data file : 807 vars,
  118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
  my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
 
  Just (another) thought.  I used to use SPSS, many, many years ago, on
  CDC machines, where the CPU had limited memory and no kind of paging
  architecture.  Files did not need to be very large for being too large.
 
  SPSS had a feature that was then useful, about the capability of
  sampling a big dataset directly at file read time, quite before
  processing starts.  Maybe something similar could help in R (that is,
  instead of reading the whole data in memory, _then_ sampling it.)
 
  One can read records from a file, up to a preset amount of them.  If the
  file happens to contain more records than that preset number (the number
  of records in the whole file is not known beforehand), already read
  records may be dropped at random and replaced by other records coming
  from the file being read.  If the random selection algorithm is properly
  chosen, it can be made so that all records in the original file have
  equal probability of being kept in the final subset.
 
  If such a sampling facility was built right within usual R reading
  routines (triggered by an extra argument, say), it could offer
  a compromise for processing large files, and also sometimes accelerate
  computations for big problems, even when memory is not at stake.
 
  --
  François Pinard   http://pinard.progiciels-bpi.ca
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 247 0281

 What the problem you are trying to solve?


--
黄荣贵
Deparment of Sociology
Fudan University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread bogdan romocea
ronggui wrote:
 If i am familiar with
 database software, using database (and R) is the best choice,but
 convert the file into database format is not an easy job for me.

Good working knowledge of a DBMS is almost invaluable when it comes to
working with very large data sets. In addition, learning SQL is piece
of cake compared to learning R. On top of that, knowledge of another
(SQL) scripting language is not needed (except perhaps for special
tasks): you can easily use R to generate the SQL syntax to import and
work with arbitrarily wide tables. (I'm not familiar with SQLite, but
MySQL comes with a command line tool that can run syntax files.)
Better start learning SQL today.


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of ronggui
 Sent: Thursday, January 05, 2006 12:48 PM
 To: jim holtman
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Suggestion for big files [was: Re: A comment
 about R:]


 2006/1/6, jim holtman [EMAIL PROTECTED]:
  If what you are reading in is numeric data, then it would
 require (807 *
  118519 * 8) 760MB just to store a single copy of the object
 -- more memory
  than you have on your computer.  If you were reading it in,
 then the problem
  is the paging that was occurring.
 In fact,If I read it in 3 pieces, each is about 170M.

 
  You have to look at storing this in a database and working
 on a subset of
  the data.  Do you really need to have all 807 variables in
 memory at the
  same time?

 Yip,I don't need all the variables.But I don't know how to get the
 necessary  variables into R.

 At last I  read the data in piece and use RSQLite package to write it
 to a database.and do then do the analysis. If i am familiar with
 database software, using database (and R) is the best choice,but
 convert the file into database format is not an easy job for me.I ask
 for help in SQLite list,but the solution is not satisfying as that
 required the knowledge about the third script language.After searching
 the internet,I get this solution:

 #begin
 rm(list=ls())
 f-file(D:\wvsevs_sb_v4.csv,r)
 i - 0
 done - FALSE
 library(RSQLite)
 con-dbConnect(SQLite,c:\sqlite\database.db3)
 tim1-Sys.time()

 while(!done){
 i-i+1
 tt-readLines(f,2500)
 if (length(tt)2500) done - TRUE
 tt-textConnection(tt)
 if (i==1) {
assign(dat,read.table(tt,head=T,sep=,,quote=));
  }
 else assign(dat,read.table(tt,head=F,sep=,,quote=))
 close(tt)
 ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T),
   dbWriteTable(con,wvs,dat) )
 }
 close(f)
 #end
 It's not the best solution,but it works.



  If you use 'scan', you could specify that you do not want
 some of the
  variables read in so it might make a more reasonably sized objects.
 
 
  On 1/5/06, François Pinard [EMAIL PROTECTED] wrote:
   [ronggui]
  
   R's week when handling large data file.  I has a data
 file : 807 vars,
   118519 obs.and its CVS format.  Stata can read it in in
 2 minus,but In
   my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
  
   Just (another) thought.  I used to use SPSS, many, many
 years ago, on
   CDC machines, where the CPU had limited memory and no
 kind of paging
   architecture.  Files did not need to be very large for
 being too large.
  
   SPSS had a feature that was then useful, about the capability of
   sampling a big dataset directly at file read time, quite before
   processing starts.  Maybe something similar could help in
 R (that is,
   instead of reading the whole data in memory, _then_ sampling it.)
  
   One can read records from a file, up to a preset amount
 of them.  If the
   file happens to contain more records than that preset
 number (the number
   of records in the whole file is not known beforehand),
 already read
   records may be dropped at random and replaced by other
 records coming
   from the file being read.  If the random selection
 algorithm is properly
   chosen, it can be made so that all records in the
 original file have
   equal probability of being kept in the final subset.
  
   If such a sampling facility was built right within usual R reading
   routines (triggered by an extra argument, say), it could offer
   a compromise for processing large files, and also
 sometimes accelerate
   computations for big problems, even when memory is not at stake.
  
   --
   François Pinard   http://pinard.progiciels-bpi.ca
  
   __
   R-help@stat.math.ethz.ch mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide!
  http://www.R-project.org/posting-guide.html
  
 
 
 
  --
  Jim Holtman
  Cincinnati, OH
  +1 513 247 0281
 
  What the problem you are trying to solve?


 --
 黄荣贵
 Deparment of Sociology
 Fudan University

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread Neuro LeSuperHéros
Rongui,

I'm not familiar with SQLite, but using MySQL would solve your problem.

MySQL has a LOAD DATA INFILE statement that loads text/csv files rapidly.

In R, assuming a test table exists in MySQL (blank table is fine), something 
like this would load the data directly in MySQL.

library(DBI)
library(RMySQL)
dbSendQuery(mycon,LOAD DATA INFILE 'C:/textfile.csv'
INTO TABLE test3 FIELDS TERMINATED BY ',' ) #for csv files

Then a normal SQL query would allow you to work with a manageable size of 
data.



From: bogdan romocea [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
CC: r-help R-help@stat.math.ethz.ch
Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:]
Date: Thu, 5 Jan 2006 15:26:51 -0500

ronggui wrote:
  If i am familiar with
  database software, using database (and R) is the best choice,but
  convert the file into database format is not an easy job for me.

Good working knowledge of a DBMS is almost invaluable when it comes to
working with very large data sets. In addition, learning SQL is piece
of cake compared to learning R. On top of that, knowledge of another
(SQL) scripting language is not needed (except perhaps for special
tasks): you can easily use R to generate the SQL syntax to import and
work with arbitrarily wide tables. (I'm not familiar with SQLite, but
MySQL comes with a command line tool that can run syntax files.)
Better start learning SQL today.


  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of ronggui
  Sent: Thursday, January 05, 2006 12:48 PM
  To: jim holtman
  Cc: r-help@stat.math.ethz.ch
  Subject: Re: [R] Suggestion for big files [was: Re: A comment
  about R:]
 
 
  2006/1/6, jim holtman [EMAIL PROTECTED]:
   If what you are reading in is numeric data, then it would
  require (807 *
   118519 * 8) 760MB just to store a single copy of the object
  -- more memory
   than you have on your computer.  If you were reading it in,
  then the problem
   is the paging that was occurring.
  In fact,If I read it in 3 pieces, each is about 170M.
 
  
   You have to look at storing this in a database and working
  on a subset of
   the data.  Do you really need to have all 807 variables in
  memory at the
   same time?
 
  Yip,I don't need all the variables.But I don't know how to get the
  necessary  variables into R.
 
  At last I  read the data in piece and use RSQLite package to write it
  to a database.and do then do the analysis. If i am familiar with
  database software, using database (and R) is the best choice,but
  convert the file into database format is not an easy job for me.I ask
  for help in SQLite list,but the solution is not satisfying as that
  required the knowledge about the third script language.After searching
  the internet,I get this solution:
 
  #begin
  rm(list=ls())
  f-file(D:\wvsevs_sb_v4.csv,r)
  i - 0
  done - FALSE
  library(RSQLite)
  con-dbConnect(SQLite,c:\sqlite\database.db3)
  tim1-Sys.time()
 
  while(!done){
  i-i+1
  tt-readLines(f,2500)
  if (length(tt)2500) done - TRUE
  tt-textConnection(tt)
  if (i==1) {
 assign(dat,read.table(tt,head=T,sep=,,quote=));
   }
  else assign(dat,read.table(tt,head=F,sep=,,quote=))
  close(tt)
  ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T),
dbWriteTable(con,wvs,dat) )
  }
  close(f)
  #end
  It's not the best solution,but it works.
 
 
 
   If you use 'scan', you could specify that you do not want
  some of the
   variables read in so it might make a more reasonably sized objects.
  
  
   On 1/5/06, François Pinard [EMAIL PROTECTED] wrote:
[ronggui]
   
R's week when handling large data file.  I has a data
  file : 807 vars,
118519 obs.and its CVS format.  Stata can read it in in
  2 minus,but In
my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
   
Just (another) thought.  I used to use SPSS, many, many
  years ago, on
CDC machines, where the CPU had limited memory and no
  kind of paging
architecture.  Files did not need to be very large for
  being too large.
   
SPSS had a feature that was then useful, about the capability of
sampling a big dataset directly at file read time, quite before
processing starts.  Maybe something similar could help in
  R (that is,
instead of reading the whole data in memory, _then_ sampling it.)
   
One can read records from a file, up to a preset amount
  of them.  If the
file happens to contain more records than that preset
  number (the number
of records in the whole file is not known beforehand),
  already read
records may be dropped at random and replaced by other
  records coming
from the file being read.  If the random selection
  algorithm is properly
chosen, it can be made so that all records in the
  original file have
equal probability of being kept in the final subset.
   
If such a sampling facility was built right within usual R reading
routines (triggered by an extra

Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread François Pinard
[Brian Ripley]

I rather thought that using a DBMS was standard practice in the 
R community for those using large datasets: it gets discussed rather 
often.

Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

Another possibility is to make use of the several DBMS interfaces already 
available for R.  It is very easy to pull in a sample from one of those, 
and surely keeping such large data files as ASCII not good practice.

Selecting a sample is easy.  Yet, I'm not aware of any SQL device for 
easily selecting a _random_ sample of the records of a given table.  On 
the other hand, I'm no SQL specialist, others might know better.

We do not have a need yet for samples where I work, but if we ever need 
such, they will have to be random, or else, I will always fear biases.

One problem with Francois Pinard's suggestion (the credit has got lost) 
is that R's I/O is not line-oriented but stream-oriented.  So selecting 
lines is not particularly easy in R.

I understand that you mean random access to lines, instead of random 
selection of lines.  Once again, this chat comes out of reading someone 
else's problem, this is not a problem I actually have.  SPSS was not 
randomly accessing lines, as data files could well be hold on magnetic 
tapes, where random access is not possible on average practice.  SPSS 
reads (or was reading) lines sequentially from beginning to end, and the 
_random_ sample is built while the reading goes.

Suppose the file (or tape) holds N records (N is not known in advance), 
from which we want a sample of M records at most.  If N = M, then we 
use the whole file, no sampling is possible nor necessary.  Otherwise, 
we first initialise M records with the first M records of the file.  
Then, for each record in the file after the M'th, the algorithm has to 
decide if the record just read will be discarded or if it will replace 
one of the M records already saved, and in the latter case, which of 
those records will be replaced.  If the algorithm is carefully designed, 
when the last (N'th) record of the file will have been processed this 
way, we may then have M records randomly selected from N records, in 
such a a way that each of the N records had an equal probability to end 
up in the selection of M records.  I may seek out for details if needed.

This is my suggestion, or in fact, more a thought that a suggestion.  It 
might represent something useful either for flat ASCII files or even for 
a stream of records coming out of a database, if those effectively do 
not offer ready random sampling devices.


P.S. - In the (rather unlikely, I admit) case the gang I'm part of would 
have the need described above, and if I then dared implementing it 
myself, would it be welcome?

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Suggestion for big files [was: Re: A comment about R:]

2006-01-05 Thread hadley wickham
 Selecting a sample is easy.  Yet, I'm not aware of any SQL device for
 easily selecting a _random_ sample of the records of a given table.  On
 the other hand, I'm no SQL specialist, others might know better.

There are a number of such devices, which tend to be rather SQL
variant specific.  Try googling for select random rows mysql, select
random rows pgsql, etc.

Another possibility is to generate a large table of randomly
distributed ids and then use that (with randomly generated limits) to
select the appropriate number of records.

Hadley

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html