Re: [R] Incremental ReadLines
Hi again, Changing my code by defining vectors outside the loop and combining them afterwards helped a lot so now the code does not slow down anymore and I was able to parse the file in less than 2 hours. Not fantastic but it works. I will William's the last suggestion of how to parse it without looping through for next time I have to parse a large file. Many thanks for your help! Frederik On Thu, Apr 14, 2011 at 4:58 PM, William Dunlap wdun...@tibco.com wrote: [see below] From: Frederik Lang [mailto:frederikl...@gmail.com] Sent: Thursday, April 14, 2011 12:56 PM To: William Dunlap Cc: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Doing the simple preallocation that I describe should speed it up a lot with very little effort. It is more work to manipulate the columns one at a time instead of using data.frame subscripting and it may not be worth it if you have lots of columns. If you have a lot of this sort of file and feel that it will be worth the programming time to do something fancier, here is some code that reads lines of the form cat(lines, sep=\n) Id: First Var1: false Var2: 6 Var3: 8 Id: Second Id: Last Var1: true Var3: 8 and produces a matrix with the Id's along the rows and the Var's along the columns: f(lines) Var1Var2 Var3 First false 6 8 Second NA NA NA Last true NA 8 The function f is: f - function (lines) { # keep only lines with colons lines - grep(value = TRUE, ^.+:, lines) lines - gsub(^[[:space:]]+|[[:space:]]+$, , lines) isIdLine - grepl(^Id:, lines) group - cumsum(isIdLine) rownames - sub(^Id:[[:space:]]*, , lines[isIdLine]) lines - lines[!isIdLine] group - group[!isIdLine] varname - sub([[:space:]]*:.*$, , lines) value - sub(.*:[[:space:]]*, , lines) colnames - unique(varname) col - match(varname, colnames) retval - array(NA_character_, c(length(rownames), length(colnames)), dimnames = list(rownames, colnames)) retval[cbind(group, col)] - value retval } The main trick is the matrix subscript given to retval on the penultimate line. Thanks again, Frederik On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote: I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same
Re: [R] Incremental ReadLines
Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederikl...@gmail.com To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A readline() method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your read would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help. suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines - grep(value=TRUE, Id:, readLines(file)) data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Freds Sent: Wednesday, April 13, 2011 10:58 AM To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3 447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi Mike, Thanks for your comment. I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done. I was under the impression that I had to do a loop since my blocks of observations are of varying length. Thanks again, Frederik On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka marchy...@hotmail.comwrote: Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederikl...@gmail.com To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A readline() method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your read would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help. suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Thanks again, Frederik On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote: I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines - grep(value=TRUE, Id:, readLines(file)) data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Freds Sent: Wednesday, April 13, 2011 10:58 AM To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3 447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Date: Thu, 14 Apr 2011 11:57:40 -0400 Subject: Re: [R] Incremental ReadLines From: frederikl...@gmail.com To: marchy...@hotmail.com CC: r-help@r-project.org Hi Mike, Thanks for your comment. I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done. I'm not sure I have a good R answer, simply pointing out the likley isuse and maybe the rest belongs on r-develoiper list or something. If you can determine you are running out of physical memory, then you either need to partitition somehting or make accesses more regular. My favorite example from personal experience is sorting a data set prior to piping into a c++ program that changed the execution time substantially by avoiding VM thrashing. R either needs a swapping buffer or has an equivalent that someone else could mention. I was under the impression that I had to do a loop since my blocks of observations are of varying length. Thanks again, Frederik On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka wrote: Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederikl...@gmail.com To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A readline() method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your read would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help. suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
[see below] From: Frederik Lang [mailto:frederikl...@gmail.com] Sent: Thursday, April 14, 2011 12:56 PM To: William Dunlap Cc: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Doing the simple preallocation that I describe should speed it up a lot with very little effort. It is more work to manipulate the columns one at a time instead of using data.frame subscripting and it may not be worth it if you have lots of columns. If you have a lot of this sort of file and feel that it will be worth the programming time to do something fancier, here is some code that reads lines of the form cat(lines, sep=\n) Id: First Var1: false Var2: 6 Var3: 8 Id: Second Id: Last Var1: true Var3: 8 and produces a matrix with the Id's along the rows and the Var's along the columns: f(lines) Var1Var2 Var3 First false 6 8 Second NA NA NA Last true NA 8 The function f is: f - function (lines) { # keep only lines with colons lines - grep(value = TRUE, ^.+:, lines) lines - gsub(^[[:space:]]+|[[:space:]]+$, , lines) isIdLine - grepl(^Id:, lines) group - cumsum(isIdLine) rownames - sub(^Id:[[:space:]]*, , lines[isIdLine]) lines - lines[!isIdLine] group - group[!isIdLine] varname - sub([[:space:]]*:.*$, , lines) value - sub(.*:[[:space:]]*, , lines) colnames - unique(varname) col - match(varname, colnames) retval - array(NA_character_, c(length(rownames), length(colnames)), dimnames = list(rownames, colnames)) retval[cbind(group, col)] - value retval } The main trick is the matrix subscript given to retval on the penultimate line. Thanks again, Frederik On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote: I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines - grep(value=TRUE, Id:, readLines(file)) data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill
Re: [R] Incremental ReadLines
Gene, You might want to look at function read.csv.ffdf from package ff which can read large csv-files into a ffdf object. That's kind of data.frame which is stored on disk resp. in the file-system-cache. Once you subscript part of it, you get a regular data.frame. Jens Oehlschlägel -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
If the headers all start with the same letter, A say, and the data only contain numbers on their lines then just use read.table(..., comment = A) On Mon, Nov 2, 2009 at 2:03 PM, Gene Leynes gleyne...@gmail.com wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
On 11/2/2009 2:03 PM, Gene Leynes wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records You can open the connection before reading, then read in blocks of lines and process those. You don't need to reopen it every time. For example, ff - file(fname, open=rt) # rt is read text for (block in 1:nblocks) { x - readLines(ff, n=121) # process this block } close(ff) Duncan Murdoch This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi Gene, Rather than using R to parse this file, have you considered using either grep or sed to pre-process the file and then read it in? It looks like you just want lines starting with numbers, so something like grep '^[0-9]\+' thefile.csv otherfile.csv should be much faster, and then you can just read in otherfile.csv using read.csv(). Best, Jim Gene Leynes wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
James, I think those are Unix commands? I'm on Windows, so that's not an option (for now) Also the suggestions posed by Duncan and Phil seem to be working. Thank you so much, such a simple thing to add the r or rt to the file connection. I read about blocking, but I didn't imagine that it meant chunks. I was thinking something more like blocking out, or guarding (perhaps for security). On Mon, Nov 2, 2009 at 1:47 PM, James W. MacDonald jmac...@med.umich.eduwrote: Hi Gene, Rather than using R to parse this file, have you considered using either grep or sed to pre-process the file and then read it in? It looks like you just want lines starting with numbers, so something like grep '^[0-9]\+' thefile.csv otherfile.csv should be much faster, and then you can just read in otherfile.csv using read.csv(). Best, Jim Gene Leynes wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.