Re: [R] Suggestion for big files [was: Re: A comment about R:]
[hadley wickham] [François Pinard] Selecting a sample is easy. Yet, I'm not aware of any SQL device for easily selecting a _random_ sample of the records of a given table. On the other hand, I'm no SQL specialist, others might know better. There are a number of such devices, which tend to be rather SQL variant specific. Try googling for select random rows mysql, select random rows pgsql, etc. Thanks as well for these hints. Googling around as your suggested (yet keeping my eyes in the MySQL direction, because this is what we use), getting MySQL itself to do the selection is a bit discouraging, as according to comments I've read, MySQL does not seem to scale well with the database size according to the comments I've read, especially when records have to be decorated with random numbers and later sorted. Yet, I did not drive any benchmark myself, and would not blindly take everything I read for granted, given that MySQL developers have speed in mind, and there are ways to interrupt a sort before running it to full completion, when only a few sorted records are wanted. Another possibility is to generate a large table of randomly distributed ids and then use that (with randomly generated limits) to select the appropriate number of records. I'm not sure I understand your idea (what mixes me in the randomly generated limits part). If the large table is much larger than the size of the wanted sample, we might not be gaining much. Just for fun: here, sample(1, 10) in R is slowish already :-). All in all, if I ever have such a problem, a practical solution probably has to be outside of R, and maybe outside SQL as well. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
I found Reservoir-Sampling algorithms of time complexity O(n(1+log(N/n))) by Kim-Hung Li , ACM Transactions on Mathematical Software Vol 20 No 4 Dec 94 p481-492. He mentions algorithm Z and K and proposed 2 improved versions alg L and M. Algorith L is really easy to implement but relatively slow, M doesn't look very difficult and is the fastest. Heberto Ghezzo McGill University Montreal - Canada Quoting François Pinard [EMAIL PROTECTED]: [Martin Maechler] FrPi Suppose the file (or tape) holds N records (N is not known FrPi in advance), from which we want a sample of M records at FrPi most. [...] If the algorithm is carefully designed, when FrPi the last (N'th) record of the file will have been processed FrPi this way, we may then have M records randomly selected from FrPi N records, in such a a way that each of the N records had an FrPi equal probability to end up in the selection of M records. I FrPi may seek out for details if needed. [...] I'm also intrigued about the details of the algorithm you outline above. I went into my old SPSS books and related references to find it for you, to no avail (yet I confess I did not try very hard). I vaguely remember it was related to Spearman's correlation computation: I did find notes about the severe memory limitation of this computation, but nothing about the implemented workaround. I did find other sampling devices, but not the very one I remember having read about, many years ago. On the other hand, Googling tells that this topic has been much studied, and that Vitter's algorithm Z seems to be popular nowadays (even if not the simplest) because it is more efficient than others. Google found a copy of the paper: http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf Here is an implementation for Postgres: http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php yet I do not find it very readable -- but this is only an opinion: I'm rather demanding in the area of legibility, while many or most people are more courageous than me! :-). -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
[Brian Ripley] [François Pinard] [Brian Ripley] One problem [...] is that R's I/O is not line-oriented but stream-oriented. So selecting lines is not particularly easy in R. I understand that you mean random access to lines, instead of random selection of lines. That was not my point. [...] Skipping lines you do not need will take longer than you might guess (based on some limited experience). Thanks for telling (and also for the expression reservoir sampling). OK, then. All summarized, if I ever need this for bigger datasets, selection might better be done outside of R. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
[Martin Maechler] FrPi Suppose the file (or tape) holds N records (N is not known FrPi in advance), from which we want a sample of M records at FrPi most. [...] If the algorithm is carefully designed, when FrPi the last (N'th) record of the file will have been processed FrPi this way, we may then have M records randomly selected from FrPi N records, in such a a way that each of the N records had an FrPi equal probability to end up in the selection of M records. I FrPi may seek out for details if needed. [...] I'm also intrigued about the details of the algorithm you outline above. I went into my old SPSS books and related references to find it for you, to no avail (yet I confess I did not try very hard). I vaguely remember it was related to Spearman's correlation computation: I did find notes about the severe memory limitation of this computation, but nothing about the implemented workaround. I did find other sampling devices, but not the very one I remember having read about, many years ago. On the other hand, Googling tells that this topic has been much studied, and that Vitter's algorithm Z seems to be popular nowadays (even if not the simplest) because it is more efficient than others. Google found a copy of the paper: http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf Here is an implementation for Postgres: http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php yet I do not find it very readable -- but this is only an opinion: I'm rather demanding in the area of legibility, while many or most people are more courageous than me! :-). -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
Thanks as well for these hints. Googling around as your suggested (yet keeping my eyes in the MySQL direction, because this is what we use), getting MySQL itself to do the selection is a bit discouraging, as according to comments I've read, MySQL does not seem to scale well with the database size according to the comments I've read, especially when records have to be decorated with random numbers and later sorted. With SQL there is always a way to do what you want quickly, but you need to think carefully about what operations are most common in your database. For example, the problem is much easier if you can assume that the rows are numbered sequentially from 1 to n. This could be enfored using a trigger whenever a record is added/deleted. This would slow insertions/deletions but speed selects. Just for fun: here, sample(1, 10) in R is slowish already :-). This is another example where greater knowledge of problem can yield speed increases. Here (where the number of selections is much smaller than the total number of objects) you are better off generating 10 numbers with runif(10, 0, 100) and then checking that they are unique Another possibility is to generate a large table of randomly distributed ids and then use that (with randomly generated limits) to select the appropriate number of records. I'm not sure I understand your idea (what mixes me in the randomly generated limits part). If the large table is much larger than the size of the wanted sample, we might not be gaining much. Think about using a table of random numbers. They are pregenerated for you, you just choose a starting and ending index. It will be slow to generate the table the first time, but then it will be fast. It will also take up quite a bit of space, but space is cheap (and time is not!) Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
[hadley wickham] [...] according to comments I've read, MySQL does not seem to scale well with the database size according to the comments I've read, especially when records have to be decorated with random numbers and later sorted. With SQL there is always a way to do what you want quickly, but you need to think carefully about what operations are most common in your database. For example, the problem is much easier if you can assume that the rows are numbered sequentially from 1 to n. This could be enfored using a trigger whenever a record is added/deleted. This would slow insertions/deletions but speed selects. Sure (for a caricature example) that if database records are already decorated with random numbers, and an index is built over the decoration, random sampling may indeed be done quicker :-). The fact is that (at least our) databases are not especially designed for random sampling, and people in charge would resist redesigning them merely because there would be a few needs for random sampling. What would be ideal is being able to build random samples out of any big database or file, with equal ease. The fact is that it's doable. (Brian Ripley points out that R textual I/O has too much overhead for being usable, so one should rather say, sadly: It's doable outside R.) Just for fun: here, sample(1, 10) in R is slowish already :-). This is another example where greater knowledge of problem can yield speed increases. Here (where the number of selections is much smaller than the total number of objects) you are better off generating 10 numbers with runif(10, 0, 100) and then checking that they are unique Of course, my remark about sample() is related to the previous discussion. If sample(N, M) was more on the O(M) side than being on the O(N) side (both memory-wise and cpu-wise), it could be used for preselecting which rows of a big database to include in a random sample, so building on your idea of using a set of IDs. As the sample of M records will have to be processed in-memory by R anyway, computing a vector of M indices does not (or should not) increase complexity. However, sample(N, M) is likely less usable for randomly sampling a database, if it is O(N) to start with. About your suggestion of using runif and later checking uniqueness, sample() could well be implemented this way, when the arguments are proper. The greater knowledge of the problem could be built in right into the routine meant to solve it. sample(N, M) could even know how to take advantage of some simplified case of a reservoir sampling technique :-). [...] a large table of randomly distributed ids [...] (with randomly generated limits) to select the appropriate number of records. [...] a table of random numbers [...] pregenerated for you, you just choose a starting and ending index. It will be slow to generate the table the first time, but then it will be fast. It will also take up quite a bit of space, but space is cheap (and time is not!) Thanks for the explanation. In the case under consideration here (random sampling of a big file or database), I would be tempted to guess that the time required for generating pseudo-random numbers is negligible when compared to the overall input/output time, so it might be that pregenerating randomized IDs is not worth the trouble. Also given that whenever the database size changes, the list of pregenerated IDs is not valid anymore. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
[Just one point extracted: Hadley Wickham has answered the random sample one] On Thu, 5 Jan 2006, François Pinard wrote: [Brian Ripley] One problem with Francois Pinard's suggestion (the credit has got lost) is that R's I/O is not line-oriented but stream-oriented. So selecting lines is not particularly easy in R. I understand that you mean random access to lines, instead of random selection of lines. Once again, this chat comes out of reading someone else's problem, this is not a problem I actually have. SPSS was not randomly accessing lines, as data files could well be hold on magnetic tapes, where random access is not possible on average practice. SPSS reads (or was reading) lines sequentially from beginning to end, and the _random_ sample is built while the reading goes. That was not my point. R's standard I/O is through connections, which allow for pushbacks, changing line endings and re-encoding character sets. That does add overhead compared to C/Fortran line-buffered reading of a file. Skipping lines you do not need will take longer than you might guess (based on some limited experience). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
FrPi == François Pinard [EMAIL PROTECTED] on Thu, 5 Jan 2006 22:41:21 -0500 writes: FrPi [Brian Ripley] I rather thought that using a DBMS was standard practice in the R community for those using large datasets: it gets discussed rather often. FrPi Indeed. (I tried RMySQL even before speaking of R to my co-workers.) Another possibility is to make use of the several DBMS interfaces already available for R. It is very easy to pull in a sample from one of those, and surely keeping such large data files as ASCII not good practice. FrPi Selecting a sample is easy. Yet, I'm not aware of any FrPi SQL device for easily selecting a _random_ sample of FrPi the records of a given table. On the other hand, I'm FrPi no SQL specialist, others might know better. FrPi We do not have a need yet for samples where I work, FrPi but if we ever need such, they will have to be random, FrPi or else, I will always fear biases. One problem with Francois Pinard's suggestion (the credit has got lost) is that R's I/O is not line-oriented but stream-oriented. So selecting lines is not particularly easy in R. FrPi I understand that you mean random access to lines, FrPi instead of random selection of lines. Once again, FrPi this chat comes out of reading someone else's problem, FrPi this is not a problem I actually have. SPSS was not FrPi randomly accessing lines, as data files could well be FrPi hold on magnetic tapes, where random access is not FrPi possible on average practice. SPSS reads (or was FrPi reading) lines sequentially from beginning to end, and FrPi the _random_ sample is built while the reading goes. FrPi Suppose the file (or tape) holds N records (N is not FrPi known in advance), from which we want a sample of M FrPi records at most. If N = M, then we use the whole FrPi file, no sampling is possible nor necessary. FrPi Otherwise, we first initialise M records with the FrPi first M records of the file. Then, for each record in FrPi the file after the M'th, the algorithm has to decide FrPi if the record just read will be discarded or if it FrPi will replace one of the M records already saved, and FrPi in the latter case, which of those records will be FrPi replaced. If the algorithm is carefully designed, FrPi when the last (N'th) record of the file will have been FrPi processed this way, we may then have M records FrPi randomly selected from N records, in such a a way that FrPi each of the N records had an equal probability to end FrPi up in the selection of M records. I may seek out for FrPi details if needed. FrPi This is my suggestion, or in fact, more a thought that FrPi a suggestion. It might represent something useful FrPi either for flat ASCII files or even for a stream of FrPi records coming out of a database, if those effectively FrPi do not offer ready random sampling devices. FrPi P.S. - In the (rather unlikely, I admit) case the gang FrPi I'm part of would have the need described above, and FrPi if I then dared implementing it myself, would it be welcome? I think this would be a very interesting tool and I'm also intrigued about the details of the algorithm you outline above. If it would be made to work on all kind of read.table()-readable files, (i.e. of course including *.csv); that might be a valuable tool for all those -- and there are many -- for whom working with DBMs is too daunting initially. Martin Maechler, ETH Zurich __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
On Fri, 6 Jan 2006, Martin Maechler wrote: FrPi == François Pinard [EMAIL PROTECTED] on Thu, 5 Jan 2006 22:41:21 -0500 writes: FrPi [Brian Ripley] I rather thought that using a DBMS was standard practice in the R community for those using large datasets: it gets discussed rather often. FrPi Indeed. (I tried RMySQL even before speaking of R to my co-workers.) Another possibility is to make use of the several DBMS interfaces already available for R. It is very easy to pull in a sample from one of those, and surely keeping such large data files as ASCII not good practice. FrPi Selecting a sample is easy. Yet, I'm not aware of any FrPi SQL device for easily selecting a _random_ sample of FrPi the records of a given table. On the other hand, I'm FrPi no SQL specialist, others might know better. FrPi We do not have a need yet for samples where I work, FrPi but if we ever need such, they will have to be random, FrPi or else, I will always fear biases. One problem with Francois Pinard's suggestion (the credit has got lost) is that R's I/O is not line-oriented but stream-oriented. So selecting lines is not particularly easy in R. FrPi I understand that you mean random access to lines, FrPi instead of random selection of lines. Once again, FrPi this chat comes out of reading someone else's problem, FrPi this is not a problem I actually have. SPSS was not FrPi randomly accessing lines, as data files could well be FrPi hold on magnetic tapes, where random access is not FrPi possible on average practice. SPSS reads (or was FrPi reading) lines sequentially from beginning to end, and FrPi the _random_ sample is built while the reading goes. FrPi Suppose the file (or tape) holds N records (N is not FrPi known in advance), from which we want a sample of M FrPi records at most. If N = M, then we use the whole FrPi file, no sampling is possible nor necessary. FrPi Otherwise, we first initialise M records with the FrPi first M records of the file. Then, for each record in FrPi the file after the M'th, the algorithm has to decide FrPi if the record just read will be discarded or if it FrPi will replace one of the M records already saved, and FrPi in the latter case, which of those records will be FrPi replaced. If the algorithm is carefully designed, FrPi when the last (N'th) record of the file will have been FrPi processed this way, we may then have M records FrPi randomly selected from N records, in such a a way that FrPi each of the N records had an equal probability to end FrPi up in the selection of M records. I may seek out for FrPi details if needed. FrPi This is my suggestion, or in fact, more a thought that FrPi a suggestion. It might represent something useful FrPi either for flat ASCII files or even for a stream of FrPi records coming out of a database, if those effectively FrPi do not offer ready random sampling devices. FrPi P.S. - In the (rather unlikely, I admit) case the gang FrPi I'm part of would have the need described above, and FrPi if I then dared implementing it myself, would it be welcome? I think this would be a very interesting tool and I'm also intrigued about the details of the algorithm you outline above. It's called `reservoir sampling' and is described in my simulation book and Knuth and elsewhere. If it would be made to work on all kind of read.table()-readable files, (i.e. of course including *.csv); that might be a valuable tool for all those -- and there are many -- for whom working with DBMs is too daunting initially. It would be better (for the reasons I gave) to do this in a separate file preprocessor: read.table reads from a connection not a file, of course. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
RG, Actually, SQLite provides a solution to read *.csv file directly into db. Just for your consideration. On 1/5/06, ronggui [EMAIL PROTECTED] wrote: 2006/1/6, jim holtman [EMAIL PROTECTED]: If what you are reading in is numeric data, then it would require (807 * 118519 * 8) 760MB just to store a single copy of the object -- more memory than you have on your computer. If you were reading it in, then the problem is the paging that was occurring. In fact,If I read it in 3 pieces, each is about 170M. You have to look at storing this in a database and working on a subset of the data. Do you really need to have all 807 variables in memory at the same time? Yip,I don't need all the variables.But I don't know how to get the necessary variables into R. At last I read the data in piece and use RSQLite package to write it to a database.and do then do the analysis. If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me.I ask for help in SQLite list,but the solution is not satisfying as that required the knowledge about the third script language.After searching the internet,I get this solution: #begin rm(list=ls()) f-file(D:\wvsevs_sb_v4.csv,r) i - 0 done - FALSE library(RSQLite) con-dbConnect(SQLite,c:\sqlite\database.db3) tim1-Sys.time() while(!done){ i-i+1 tt-readLines(f,2500) if (length(tt)2500) done - TRUE tt-textConnection(tt) if (i==1) { assign(dat,read.table(tt,head=T,sep=,,quote=)); } else assign(dat,read.table(tt,head=F,sep=,,quote=)) close(tt) ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T), dbWriteTable(con,wvs,dat) ) } close(f) #end It's not the best solution,but it works. If you use 'scan', you could specify that you do not want some of the variables read in so it might make a more reasonably sized objects. On 1/5/06, François Pinard [EMAIL PROTECTED] wrote: [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Jim Holtman Cincinnati, OH +1 513 247 0281 What the problem you are trying to solve? -- é»è£è´µ Deparment of Sociology Fudan University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- WenSui Liu (http://statcompute.blogspot.com) Senior Decision Support Analyst Health Policy and Clinical Effectiveness Cincinnati Children Hospital Medical Center [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
RG, I think .import command in sqlite should work. plus, sqlite browser ( http://sqlitebrowser.sourceforge.net) might do the work as well. On 1/6/06, ronggui [EMAIL PROTECTED] wrote: Can you give me some hints? or let me know how to do ? Thank you ! 2006/1/6, Wensui Liu [EMAIL PROTECTED]: RG, Actually, SQLite provides a solution to read *.csv file directly into db. Just for your consideration. On 1/5/06, ronggui [EMAIL PROTECTED] wrote: 2006/1/6, jim holtman [EMAIL PROTECTED]: If what you are reading in is numeric data, then it would require (807 * 118519 * 8) 760MB just to store a single copy of the object -- more memory than you have on your computer. If you were reading it in, then the problem is the paging that was occurring. In fact,If I read it in 3 pieces, each is about 170M. You have to look at storing this in a database and working on a subset of the data. Do you really need to have all 807 variables in memory at the same time? Yip,I don't need all the variables.But I don't know how to get the necessary variables into R. At last I read the data in piece and use RSQLite package to write it to a database.and do then do the analysis. If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me.I ask for help in SQLite list,but the solution is not satisfying as that required the knowledge about the third script language.After searching the internet,I get this solution: #begin rm(list=ls()) f-file(D:\wvsevs_sb_v4.csv,r) i - 0 done - FALSE library(RSQLite) con-dbConnect(SQLite,c:\sqlite\database.db3) tim1-Sys.time() while(!done){ i-i+1 tt-readLines(f,2500) if (length(tt)2500) done - TRUE tt-textConnection(tt) if (i==1) { assign(dat,read.table(tt,head=T,sep=,,quote=)); } else assign(dat,read.table(tt,head=F,sep=,,quote=)) close(tt) ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T), dbWriteTable(con,wvs,dat) ) } close(f) #end It's not the best solution,but it works. If you use 'scan', you could specify that you do not want some of the variables read in so it might make a more reasonably sized objects. On 1/5/06, François Pinard [EMAIL PROTECTED] wrote: [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Jim Holtman Cincinnati, OH +1 513 247 0281 What the problem you are trying to solve? -- é»è£è´µ Deparment of Sociology Fudan University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- WenSui Liu (http://statcompute.blogspot.com) Senior Decision Support Analyst Health Policy and Clinical Effectiveness Cincinnati Children Hospital Medical Center -- é»è£è´µ Deparment of Sociology Fudan
[R] Suggestion for big files [was: Re: A comment about R:]
[ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
-Original Message- [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. Since I often work with images and other large data sets, I have been thinking about a BLOb (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed. So I see 3 possibilities: 1. The sort of functionality you describe is implemented in the R internals (by people other than me). 2. Some individuals (perhaps myself included) write such a package. 3. This thread fizzles out and we do nothing. I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting- guide.html This email message, including any attachments, is for the so...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
Another possibility is to make use of the several DBMS interfaces already available for R. It is very easy to pull in a sample from one of those, and surely keeping such large data files as ASCII not good practice. One problem with Francois Pinard's suggestion (the credit has got lost) is that R's I/O is not line-oriented but stream-oriented. So selecting lines is not particularly easy in R. That's a deliberate design decision, given the DBMS interfaces. I rather thought that using a DBMS was standard practice in the R community for those using large datasets: it gets discussed rather often. On Thu, 5 Jan 2006, Kort, Eric wrote: -Original Message- [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. Since I often work with images and other large data sets, I have been thinking about a BLOb (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed. So I see 3 possibilities: 1. The sort of functionality you describe is implemented in the R internals (by people other than me). 2. Some individuals (perhaps myself included) write such a package. 3. This thread fizzles out and we do nothing. I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing. -- François Pinard http://pinard.progiciels-bpi.ca -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
If what you are reading in is numeric data, then it would require (807 * 118519 * 8) 760MB just to store a single copy of the object -- more memory than you have on your computer. If you were reading it in, then the problem is the paging that was occurring. You have to look at storing this in a database and working on a subset of the data. Do you really need to have all 807 variables in memory at the same time? If you use 'scan', you could specify that you do not want some of the variables read in so it might make a more reasonably sized objects. On 1/5/06, François Pinard [EMAIL PROTECTED] wrote: [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Jim Holtman Cincinnati, OH +1 513 247 0281 What the problem you are trying to solve? [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
2006/1/6, jim holtman [EMAIL PROTECTED]: If what you are reading in is numeric data, then it would require (807 * 118519 * 8) 760MB just to store a single copy of the object -- more memory than you have on your computer. If you were reading it in, then the problem is the paging that was occurring. In fact,If I read it in 3 pieces, each is about 170M. You have to look at storing this in a database and working on a subset of the data. Do you really need to have all 807 variables in memory at the same time? Yip,I don't need all the variables.But I don't know how to get the necessary variables into R. At last I read the data in piece and use RSQLite package to write it to a database.and do then do the analysis. If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me.I ask for help in SQLite list,but the solution is not satisfying as that required the knowledge about the third script language.After searching the internet,I get this solution: #begin rm(list=ls()) f-file(D:\wvsevs_sb_v4.csv,r) i - 0 done - FALSE library(RSQLite) con-dbConnect(SQLite,c:\sqlite\database.db3) tim1-Sys.time() while(!done){ i-i+1 tt-readLines(f,2500) if (length(tt)2500) done - TRUE tt-textConnection(tt) if (i==1) { assign(dat,read.table(tt,head=T,sep=,,quote=)); } else assign(dat,read.table(tt,head=F,sep=,,quote=)) close(tt) ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T), dbWriteTable(con,wvs,dat) ) } close(f) #end It's not the best solution,but it works. If you use 'scan', you could specify that you do not want some of the variables read in so it might make a more reasonably sized objects. On 1/5/06, François Pinard [EMAIL PROTECTED] wrote: [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Jim Holtman Cincinnati, OH +1 513 247 0281 What the problem you are trying to solve? -- 黄荣贵 Deparment of Sociology Fudan University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
ronggui wrote: If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me. Good working knowledge of a DBMS is almost invaluable when it comes to working with very large data sets. In addition, learning SQL is piece of cake compared to learning R. On top of that, knowledge of another (SQL) scripting language is not needed (except perhaps for special tasks): you can easily use R to generate the SQL syntax to import and work with arbitrarily wide tables. (I'm not familiar with SQLite, but MySQL comes with a command line tool that can run syntax files.) Better start learning SQL today. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of ronggui Sent: Thursday, January 05, 2006 12:48 PM To: jim holtman Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:] 2006/1/6, jim holtman [EMAIL PROTECTED]: If what you are reading in is numeric data, then it would require (807 * 118519 * 8) 760MB just to store a single copy of the object -- more memory than you have on your computer. If you were reading it in, then the problem is the paging that was occurring. In fact,If I read it in 3 pieces, each is about 170M. You have to look at storing this in a database and working on a subset of the data. Do you really need to have all 807 variables in memory at the same time? Yip,I don't need all the variables.But I don't know how to get the necessary variables into R. At last I read the data in piece and use RSQLite package to write it to a database.and do then do the analysis. If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me.I ask for help in SQLite list,but the solution is not satisfying as that required the knowledge about the third script language.After searching the internet,I get this solution: #begin rm(list=ls()) f-file(D:\wvsevs_sb_v4.csv,r) i - 0 done - FALSE library(RSQLite) con-dbConnect(SQLite,c:\sqlite\database.db3) tim1-Sys.time() while(!done){ i-i+1 tt-readLines(f,2500) if (length(tt)2500) done - TRUE tt-textConnection(tt) if (i==1) { assign(dat,read.table(tt,head=T,sep=,,quote=)); } else assign(dat,read.table(tt,head=F,sep=,,quote=)) close(tt) ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T), dbWriteTable(con,wvs,dat) ) } close(f) #end It's not the best solution,but it works. If you use 'scan', you could specify that you do not want some of the variables read in so it might make a more reasonably sized objects. On 1/5/06, François Pinard [EMAIL PROTECTED] wrote: [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra argument, say), it could offer a compromise for processing large files, and also sometimes accelerate computations for big problems, even when memory is not at stake. -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Jim Holtman Cincinnati, OH +1 513 247 0281 What the problem you are trying to solve? -- 黄荣贵 Deparment of Sociology Fudan University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting
Re: [R] Suggestion for big files [was: Re: A comment about R:]
Rongui, I'm not familiar with SQLite, but using MySQL would solve your problem. MySQL has a LOAD DATA INFILE statement that loads text/csv files rapidly. In R, assuming a test table exists in MySQL (blank table is fine), something like this would load the data directly in MySQL. library(DBI) library(RMySQL) dbSendQuery(mycon,LOAD DATA INFILE 'C:/textfile.csv' INTO TABLE test3 FIELDS TERMINATED BY ',' ) #for csv files Then a normal SQL query would allow you to work with a manageable size of data. From: bogdan romocea [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: r-help R-help@stat.math.ethz.ch Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:] Date: Thu, 5 Jan 2006 15:26:51 -0500 ronggui wrote: If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me. Good working knowledge of a DBMS is almost invaluable when it comes to working with very large data sets. In addition, learning SQL is piece of cake compared to learning R. On top of that, knowledge of another (SQL) scripting language is not needed (except perhaps for special tasks): you can easily use R to generate the SQL syntax to import and work with arbitrarily wide tables. (I'm not familiar with SQLite, but MySQL comes with a command line tool that can run syntax files.) Better start learning SQL today. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of ronggui Sent: Thursday, January 05, 2006 12:48 PM To: jim holtman Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:] 2006/1/6, jim holtman [EMAIL PROTECTED]: If what you are reading in is numeric data, then it would require (807 * 118519 * 8) 760MB just to store a single copy of the object -- more memory than you have on your computer. If you were reading it in, then the problem is the paging that was occurring. In fact,If I read it in 3 pieces, each is about 170M. You have to look at storing this in a database and working on a subset of the data. Do you really need to have all 807 variables in memory at the same time? Yip,I don't need all the variables.But I don't know how to get the necessary variables into R. At last I read the data in piece and use RSQLite package to write it to a database.and do then do the analysis. If i am familiar with database software, using database (and R) is the best choice,but convert the file into database format is not an easy job for me.I ask for help in SQLite list,but the solution is not satisfying as that required the knowledge about the third script language.After searching the internet,I get this solution: #begin rm(list=ls()) f-file(D:\wvsevs_sb_v4.csv,r) i - 0 done - FALSE library(RSQLite) con-dbConnect(SQLite,c:\sqlite\database.db3) tim1-Sys.time() while(!done){ i-i+1 tt-readLines(f,2500) if (length(tt)2500) done - TRUE tt-textConnection(tt) if (i==1) { assign(dat,read.table(tt,head=T,sep=,,quote=)); } else assign(dat,read.table(tt,head=F,sep=,,quote=)) close(tt) ifelse(dbExistsTable(con, wvs),dbWriteTable(con,wvs,dat,append=T), dbWriteTable(con,wvs,dat) ) } close(f) #end It's not the best solution,but it works. If you use 'scan', you could specify that you do not want some of the variables read in so it might make a more reasonably sized objects. On 1/5/06, François Pinard [EMAIL PROTECTED] wrote: [ronggui] R's week when handling large data file. I has a data file : 807 vars, 118519 obs.and its CVS format. Stata can read it in in 2 minus,but In my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M. Just (another) thought. I used to use SPSS, many, many years ago, on CDC machines, where the CPU had limited memory and no kind of paging architecture. Files did not need to be very large for being too large. SPSS had a feature that was then useful, about the capability of sampling a big dataset directly at file read time, quite before processing starts. Maybe something similar could help in R (that is, instead of reading the whole data in memory, _then_ sampling it.) One can read records from a file, up to a preset amount of them. If the file happens to contain more records than that preset number (the number of records in the whole file is not known beforehand), already read records may be dropped at random and replaced by other records coming from the file being read. If the random selection algorithm is properly chosen, it can be made so that all records in the original file have equal probability of being kept in the final subset. If such a sampling facility was built right within usual R reading routines (triggered by an extra
Re: [R] Suggestion for big files [was: Re: A comment about R:]
[Brian Ripley] I rather thought that using a DBMS was standard practice in the R community for those using large datasets: it gets discussed rather often. Indeed. (I tried RMySQL even before speaking of R to my co-workers.) Another possibility is to make use of the several DBMS interfaces already available for R. It is very easy to pull in a sample from one of those, and surely keeping such large data files as ASCII not good practice. Selecting a sample is easy. Yet, I'm not aware of any SQL device for easily selecting a _random_ sample of the records of a given table. On the other hand, I'm no SQL specialist, others might know better. We do not have a need yet for samples where I work, but if we ever need such, they will have to be random, or else, I will always fear biases. One problem with Francois Pinard's suggestion (the credit has got lost) is that R's I/O is not line-oriented but stream-oriented. So selecting lines is not particularly easy in R. I understand that you mean random access to lines, instead of random selection of lines. Once again, this chat comes out of reading someone else's problem, this is not a problem I actually have. SPSS was not randomly accessing lines, as data files could well be hold on magnetic tapes, where random access is not possible on average practice. SPSS reads (or was reading) lines sequentially from beginning to end, and the _random_ sample is built while the reading goes. Suppose the file (or tape) holds N records (N is not known in advance), from which we want a sample of M records at most. If N = M, then we use the whole file, no sampling is possible nor necessary. Otherwise, we first initialise M records with the first M records of the file. Then, for each record in the file after the M'th, the algorithm has to decide if the record just read will be discarded or if it will replace one of the M records already saved, and in the latter case, which of those records will be replaced. If the algorithm is carefully designed, when the last (N'th) record of the file will have been processed this way, we may then have M records randomly selected from N records, in such a a way that each of the N records had an equal probability to end up in the selection of M records. I may seek out for details if needed. This is my suggestion, or in fact, more a thought that a suggestion. It might represent something useful either for flat ASCII files or even for a stream of records coming out of a database, if those effectively do not offer ready random sampling devices. P.S. - In the (rather unlikely, I admit) case the gang I'm part of would have the need described above, and if I then dared implementing it myself, would it be welcome? -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Suggestion for big files [was: Re: A comment about R:]
Selecting a sample is easy. Yet, I'm not aware of any SQL device for easily selecting a _random_ sample of the records of a given table. On the other hand, I'm no SQL specialist, others might know better. There are a number of such devices, which tend to be rather SQL variant specific. Try googling for select random rows mysql, select random rows pgsql, etc. Another possibility is to generate a large table of randomly distributed ids and then use that (with randomly generated limits) to select the appropriate number of records. Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html