Re: [R] Value Lookup from File without Slurping
On Fri, 2009-01-16 at 18:02 +0900, Gundala Viswanath wrote: Dear all, I have a repository file (let's call it repo.txt) that contain two columns like this: # tag value AAA0.2 AAT0.3 AAC 0.02 AAG 0.02 ATA0.3 ATT 0.7 Given another query vector qr - c(AAC, ATT) I would like to find the corresponding value for each query above, yielding: 0.02 0.7 However, I want to avoid slurping whole repo.txt into an object (e.g. hash). Is there any ways to do that? The reason I want to do that because repo.txt is very2 large size (milions of lines, with tag length 30 bp), and my PC memory is too small to keep it. - Gundala Viswanath Jakarta - Indonesia Hello, You can always store your repo.txt into a database, say, SQLite, and select only the values you want via an SQL query. Thus, you will prevent loading the full file into memory. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
you might try to iteratively read a limited number of line of lines in a batch using readLines: # filename, the name of your file # n, the maximal count of lines to read in a batch connection = file(filename, open=rt) while (length(lines - readLines(con=connection, n=n))) { # do your stuff here } close(connection) ?file ?readLines vQ Gundala Viswanath wrote: Dear all, I have a repository file (let's call it repo.txt) that contain two columns like this: # tag value AAA0.2 AAT0.3 AAC 0.02 AAG 0.02 ATA0.3 ATT 0.7 Given another query vector qr - c(AAC, ATT) I would like to find the corresponding value for each query above, yielding: 0.02 0.7 However, I want to avoid slurping whole repo.txt into an object (e.g. hash). Is there any ways to do that? The reason I want to do that because repo.txt is very2 large size (milions of lines, with tag length 30 bp), and my PC memory is too small to keep it. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
The sqldf package can read a large file to a database without going through R followed by extracting it. The package makes it easier to use RSQLite by setting up the database for you and after extracting the portion you want removing the database automatically. You can specify all this in two lines: one to name the file and one to specify the extraction using SQL. See the examples in example 6 on the home page: http://sqldf.googecode.com#Example_6._File_Input On Fri, Jan 16, 2009 at 4:12 AM, Carlos J. Gil Bellosta c...@datanalytics.com wrote: On Fri, 2009-01-16 at 18:02 +0900, Gundala Viswanath wrote: Dear all, I have a repository file (let's call it repo.txt) that contain two columns like this: # tag value AAA0.2 AAT0.3 AAC 0.02 AAG 0.02 ATA0.3 ATT 0.7 Given another query vector qr - c(AAC, ATT) I would like to find the corresponding value for each query above, yielding: 0.02 0.7 However, I want to avoid slurping whole repo.txt into an object (e.g. hash). Is there any ways to do that? The reason I want to do that because repo.txt is very2 large size (milions of lines, with tag length 30 bp), and my PC memory is too small to keep it. - Gundala Viswanath Jakarta - Indonesia Hello, You can always store your repo.txt into a database, say, SQLite, and select only the values you want via an SQL query. Thus, you will prevent loading the full file into memory. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
Something like this should work library(R.utils) out = numeric() qr = c(AAC, ATT) n =countLines(test.txt) file = file(test.txt, r) for (i in 1:n){ line = readLines(file, n = 1) A = strsplit (line, split = )[[1]][1] if(is.element(A, qr)) { value = as.numeric(strsplit (line, split = )[[1]][2]) out = c(out, value) } } You may want to improve execution speed by reading data in chunks instead of line by line. Code requires a little modification Carlos J. Gil Bellosta wrote: On Fri, 2009-01-16 at 18:02 +0900, Gundala Viswanath wrote: Dear all, I have a repository file (let's call it repo.txt) that contain two columns like this: # tag value AAA0.2 AAT0.3 AAC 0.02 AAG 0.02 ATA0.3 ATT 0.7 Given another query vector qr - c(AAC, ATT) I would like to find the corresponding value for each query above, yielding: 0.02 0.7 However, I want to avoid slurping whole repo.txt into an object (e.g. hash). Is there any ways to do that? The reason I want to do that because repo.txt is very2 large size (milions of lines, with tag length 30 bp), and my PC memory is too small to keep it. - Gundala Viswanath Jakarta - Indonesia Hello, You can always store your repo.txt into a database, say, SQLite, and select only the values you want via an SQL query. Thus, you will prevent loading the full file into memory. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
if the file is really large, reading it twice may add considerable penalty: r...@quantide.com wrote: Something like this should work library(R.utils) out = numeric() qr = c(AAC, ATT) n =countLines(test.txt) # 1st pass file = file(test.txt, r) for (i in 1:n){ # 2nd pass line = readLines(file, n = 1) A = strsplit (line, split = )[[1]][1] if(is.element(A, qr)) { value = as.numeric(strsplit (line, split = )[[1]][2]) out = c(out, value) } } if this is a one-go task, counting the lines does not pay, and why bother. if this is a repetitive task, a database-based solution will probably be a better idea. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using file() is not a real reading of all the file. This function will simply open a connection to the file without reading it. countLines should do something lile wc -l from a bash shell I would say that if this is a one time job this solution should work even thought is not the fastest. In case this job is a repetitive one, then a database solution is surely better A. Wacek Kusnierczyk wrote: if the file is really large, reading it twice may add considerable penalty: r...@quantide.com wrote: Something like this should work library(R.utils) out = numeric() qr = c(AAC, ATT) n =countLines(test.txt) # 1st pass file = file(test.txt, r) for (i in 1:n){ # 2nd pass line = readLines(file, n = 1) A = strsplit (line, split = )[[1]][1] if(is.element(A, qr)) { value = as.numeric(strsplit (line, split = )[[1]][2]) out = c(out, value) } } if this is a one-go task, counting the lines does not pay, and why bother. if this is a repetitive task, a database-based solution will probably be a better idea. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
r...@quantide.com wrote: Using file() is not a real reading of all the file. This function will simply open a connection to the file without reading it. countLines should do something lile wc -l from a bash shell just for a test: cat(rep('', 10^7), file='test.txt', fill=1) library(R.utils) system.time(countLines('test.txt')) ... and the file is just about 30MB (and it makes no real difference if it is stuffed with newlines or not). vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
r...@quantide.com wrote: I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. and won't pay if you want to do the lookup just once. Using file() is not a real reading of all the file. This function will simply open a connection to the file without reading it. countLines should do something lile wc -l from a bash shell ... and wc knows the count of lines in a file without reading it vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
On Fri, Jan 16, 2009 at 5:52 AM, r...@quantide.com r...@quantide.com wrote: I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using sqldf as mentioned previously on this thread allows one to use the SQLite database with no setup at all. sqldf automatically creates the database, generates the record layout, loads the file (not going through R but outside of R so R does not slow it down) and extracts the portion you want into R issuing the appropriate calls to RSQLite/DBI and destroying the database afterwards all automatically. When you install sqldf it automatically installs RSQLite and the SQLite database itself so the entire installation is just one line: install.packages(sqldf) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
Hi Gabor, Do you mean storing data in sqldf', doesn't take memory? For example, I have 3GB data file. with standard R object using read.table() the object size will explode twice ~6GB. My current 4GB RAM cannot handle that. Do you mean with sqldf, this is not the issue? Why is that? Sorry for my naive question. - Gundala Viswanath Jakarta - Indonesia On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Fri, Jan 16, 2009 at 5:52 AM, r...@quantide.com r...@quantide.com wrote: I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using sqldf as mentioned previously on this thread allows one to use the SQLite database with no setup at all. sqldf automatically creates the database, generates the record layout, loads the file (not going through R but outside of R so R does not slow it down) and extracts the portion you want into R issuing the appropriate calls to RSQLite/DBI and destroying the database afterwards all automatically. When you install sqldf it automatically installs RSQLite and the SQLite database itself so the entire installation is just one line: install.packages(sqldf) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
Only the portion your extract is ever in R -- the file itself is read into a database without ever going through R so your memory requirements correspond to what you extract, not the size of the file. On Fri, Jan 16, 2009 at 10:49 AM, Gundala Viswanath gunda...@gmail.com wrote: Hi Gabor, Do you mean storing data in sqldf', doesn't take memory? For example, I have 3GB data file. with standard R object using read.table() the object size will explode twice ~6GB. My current 4GB RAM cannot handle that. Do you mean with sqldf, this is not the issue? Why is that? Sorry for my naive question. - Gundala Viswanath Jakarta - Indonesia On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Fri, Jan 16, 2009 at 5:52 AM, r...@quantide.com r...@quantide.com wrote: I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using sqldf as mentioned previously on this thread allows one to use the SQLite database with no setup at all. sqldf automatically creates the database, generates the record layout, loads the file (not going through R but outside of R so R does not slow it down) and extracts the portion you want into R issuing the appropriate calls to RSQLite/DBI and destroying the database afterwards all automatically. When you install sqldf it automatically installs RSQLite and the SQLite database itself so the entire installation is just one line: install.packages(sqldf) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
Hi, Unless you specify an in-memory database the database is stored on disk. Thanks for your explanation. I just downloaded 'sqldf'. Where can I find the option for that? In sqldf I can't see the command. I looked at: envir = parent.frame() doesn't appear to be the one. - Gundala Viswanath Jakarta - Indonesia On Fri, Jan 16, 2009 at 10:59 AM, Gundala Viswanath gunda...@gmail.com wrote: Hi Gabor, the file itself is read into a database The above doesn't use RAM memory? Rgds, GV. without ever going through R so your memory requirements correspond to what you extract, not the size of the file. On Fri, Jan 16, 2009 at 10:49 AM, Gundala Viswanath gunda...@gmail.com wrote: Hi Gabor, Do you mean storing data in sqldf', doesn't take memory? For example, I have 3GB data file. with standard R object using read.table() the object size will explode twice ~6GB. My current 4GB RAM cannot handle that. Do you mean with sqldf, this is not the issue? Why is that? Sorry for my naive question. - Gundala Viswanath Jakarta - Indonesia On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Fri, Jan 16, 2009 at 5:52 AM, r...@quantide.com r...@quantide.com wrote: I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using sqldf as mentioned previously on this thread allows one to use the SQLite database with no setup at all. sqldf automatically creates the database, generates the record layout, loads the file (not going through R but outside of R so R does not slow it down) and extracts the portion you want into R issuing the appropriate calls to RSQLite/DBI and destroying the database afterwards all automatically. When you install sqldf it automatically installs RSQLite and the SQLite database itself so the entire installation is just one line: install.packages(sqldf) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Value Lookup from File without Slurping
If that refers to using a database on disk to temporarily hold the file then example 6 on the home page shows it, as mentioned, and you may wish to look at the other examples there too and there is further documentation in the ?sqldf help file. On Fri, Jan 16, 2009 at 11:11 AM, Gundala Viswanath gunda...@gmail.com wrote: Hi, Unless you specify an in-memory database the database is stored on disk. Thanks for your explanation. I just downloaded 'sqldf'. Where can I find the option for that? In sqldf I can't see the command. I looked at: envir = parent.frame() doesn't appear to be the one. - Gundala Viswanath Jakarta - Indonesia On Fri, Jan 16, 2009 at 10:59 AM, Gundala Viswanath gunda...@gmail.com wrote: Hi Gabor, the file itself is read into a database The above doesn't use RAM memory? Rgds, GV. without ever going through R so your memory requirements correspond to what you extract, not the size of the file. On Fri, Jan 16, 2009 at 10:49 AM, Gundala Viswanath gunda...@gmail.com wrote: Hi Gabor, Do you mean storing data in sqldf', doesn't take memory? For example, I have 3GB data file. with standard R object using read.table() the object size will explode twice ~6GB. My current 4GB RAM cannot handle that. Do you mean with sqldf, this is not the issue? Why is that? Sorry for my naive question. - Gundala Viswanath Jakarta - Indonesia On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Fri, Jan 16, 2009 at 5:52 AM, r...@quantide.com r...@quantide.com wrote: I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using sqldf as mentioned previously on this thread allows one to use the SQLite database with no setup at all. sqldf automatically creates the database, generates the record layout, loads the file (not going through R but outside of R so R does not slow it down) and extracts the portion you want into R issuing the appropriate calls to RSQLite/DBI and destroying the database afterwards all automatically. When you install sqldf it automatically installs RSQLite and the SQLite database itself so the entire installation is just one line: install.packages(sqldf) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.