subject:"Re\: \[R\] Incremental ReadLines"

Re: [R] Incremental ReadLines

2011-04-17 Thread Frederik Lang

Hi again,

Changing my code by defining vectors outside the loop and combining them
afterwards helped a lot so now the code does not slow down anymore and I was
able to parse the file in less than 2 hours. Not fantastic but it works.

I will William's the last suggestion of how to parse it without looping
through for next time I have to parse a large file.

Many thanks for your help!


Frederik

On Thu, Apr 14, 2011 at 4:58 PM, William Dunlap wdun...@tibco.com wrote:

 [see below]

 From: Frederik Lang [mailto:frederikl...@gmail.com]
 Sent: Thursday, April 14, 2011 12:56 PM
 To: William Dunlap
 Cc: r-help@r-project.org
 Subject: Re: [R] Incremental ReadLines



 Hi Bill,

Thank you so much for your suggestions. I will try and alter my
 code.


Regarding the even shorter solution outside the loop it looks
 good but my problem is that not all observations have the same variables
 so that three different observations might look like this:


Id: 1
Var1: false
Var2: 6
Var3: 8

Id: 2
missing

Id: 3
Var1: true
3 4 5
Var2: 7
Var3: 3


Doing it without looping through I thought my data had to quite
 systematic, which it is not. I might be wrong though.

 Doing the simple preallocation that I describe should speed it up
 a lot with very little effort.  It is more work to manipulate the
 columns one at a time instead of using data.frame subscripting and
 it may not be worth it if you have lots of columns.

 If you have a lot of this sort of file and feel that it will be worth
 the programming time to do something fancier, here is some code that
 reads lines of the form

  cat(lines, sep=\n)
 Id: First
   Var1: false
  Var2: 6
  Var3: 8

 Id: Second
 Id: Last
  Var1: true
  Var3: 8

 and produces a matrix with the Id's along the rows and the Var's
 along the columns:

  f(lines)
   Var1Var2 Var3
 First  false 6  8
 Second NA  NA   NA
 Last   true  NA   8

 The function f is:

 f - function (lines)
 {
# keep only lines with colons
lines - grep(value = TRUE, ^.+:, lines)
lines - gsub(^[[:space:]]+|[[:space:]]+$, , lines)
isIdLine - grepl(^Id:, lines)
group - cumsum(isIdLine)
rownames - sub(^Id:[[:space:]]*, , lines[isIdLine])
lines - lines[!isIdLine]
group - group[!isIdLine]
varname - sub([[:space:]]*:.*$, , lines)
value - sub(.*:[[:space:]]*, , lines)
colnames - unique(varname)
col - match(varname, colnames)
retval - array(NA_character_, c(length(rownames),
 length(colnames)),
dimnames = list(rownames, colnames))
retval[cbind(group, col)] - value
retval
 }

 The main trick is the matrix subscript given to retval on the
 penultimate line.

Thanks again,


Frederik



On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap
 wdun...@tibco.com wrote:


I have two suggestions to speed up your code, if you
must use a loop.

First, don't grow your output dataset at each iteration.
Instead of
cases - 0
output - numeric(cases)
while(length(line - readLines(input, n=1))==1) {
   cases - cases + 1
   output[cases] - as.numeric(line)
}
preallocate the output vector to be about the size of
its eventual length (slightly bigger is better),
 replacing
output - numeric(0)
with the likes of
output - numeric(50)
and when you are done with the loop trim down the length
if it is too big
if (cases  length(output)) length(output) - cases
Growing your dataset in a loop can cause quadratic or
 worse
growth in time with problem size and the above sort of
code should make the time grow linearly with problem
 size.

Second, don't do data.frame subscripting inside your
 loop.
Instead of
data - data.frame(Id=numeric(cases))
while(...) {
data[cases, 1] - newValue
}
do
Id - numeric(cases)
while(...) {
Id[cases] - newValue
}
data - data.frame(Id = Id)
This is just the general principal that you don't want
 to
repeat the same operation over and over in a loop.
dataFrame[i,j] first extracts column j then extracts
 element
i from that column.  Since the column is the same every
 iteration
you may as well extract the column outside of the loop.

Avoiding the loop altogether is the fastest.  E.g., the
 code
you showed does the same

Re: [R] Incremental ReadLines

2011-04-14 Thread Freds

Hi there,

I am having a similar problem with reading in a large text file with around
550.000 observations with each 10 to 100 lines of description. I am trying
to parse it in R but I have troubles with the size of the file. It seems
like it is slowing down dramatically at some point. I would be happy for any
suggestions. Here is my code, which works fine when I am doing a subsample
of my dataset.

#Defining datasource
file - filename.txt

#Creating placeholder for data and assigning column names
data - data.frame(Id=NA)

#Starting by case = 0
case - 0

#Opening a connection to data
input - file(file, rt)

#Going through cases
repeat {
  line - readLines(input, n=1)
  if (length(line)==0) break
  if (length(grep(Id:,line)) != 0) {
case - case + 1 ; data[case,] -NA
split_line - strsplit(line,Id:)
data[case,1] - as.numeric(split_line[[1]][2])
}
}

#Closing connection
close(input)

#Saving dataframe
write.csv(data,'data.csv')


Kind regards,


Frederik


--
View this message in context: 
http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2011-04-14 Thread Mike Marchywka

 Date: Wed, 13 Apr 2011 10:57:58 -0700
 From: frederikl...@gmail.com
 To: r-help@r-project.org
 Subject: Re: [R] Incremental ReadLines

 Hi there,

 I am having a similar problem with reading in a large text file with around
 550.000 observations with each 10 to 100 lines of description. I am trying
 to parse it in R but I have troubles with the size of the file. It seems
 like it is slowing down dramatically at some point. I would be happy for any

This probably occurs when you run out of physical memory but you can
probably verify by looking at task manager. A readline() method
wouldn't fit real well with R as you try to had blocks of data
so that inner loops, implemented largely in native code, can operate
efficiently. The thing you want is a data structure that can use
disk more effectively and hide these details from you and algorightm.
This works best if the algorithm works with data strcuture to avoid
lots of disk thrashing. You coudl imagine that your read would do
nothing until each item is needed but often people want the whole
file validated before procesing, lots of details come up with exception
handling as you get fancy here. Note of course that your parse output
could be stored in a hash or something represnting a DOM and this could
get arbitrarily large. Since it is designed for random access, this may
cause lots of thrashing if partially on disk. Anything you can do to 
make access patterns more regular, for example sort your data, would help.

 suggestions. Here is my code, which works fine when I am doing a subsample
 of my dataset.

 #Defining datasource
 file - filename.txt

 #Creating placeholder for data and assigning column names
 data - data.frame(Id=NA)

 #Starting by case = 0
 case - 0

 #Opening a connection to data
 input - file(file, rt)

 #Going through cases
 repeat {
 line - readLines(input, n=1)
 if (length(line)==0) break
 if (length(grep(Id:,line)) != 0) {
 case - case + 1 ; data[case,] -NA
 split_line - strsplit(line,Id:)
 data[case,1] - as.numeric(split_line[[1]][2])
 }
 }

 #Closing connection
 close(input)

 #Saving dataframe
 write.csv(data,'data.csv')

 Kind regards,

 Frederik

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2011-04-14 Thread William Dunlap

I have two suggestions to speed up your code, if you
must use a loop.

First, don't grow your output dataset at each iteration.
Instead of
 cases - 0
 output - numeric(cases)
 while(length(line - readLines(input, n=1))==1) {
cases - cases + 1
output[cases] - as.numeric(line)
 }
preallocate the output vector to be about the size of
its eventual length (slightly bigger is better), replacing
 output - numeric(0)
with the likes of
 output - numeric(50)
and when you are done with the loop trim down the length
if it is too big
 if (cases  length(output)) length(output) - cases
Growing your dataset in a loop can cause quadratic or worse
growth in time with problem size and the above sort of
code should make the time grow linearly with problem size.

Second, don't do data.frame subscripting inside your loop.
Instead of
 data - data.frame(Id=numeric(cases))
 while(...) {
 data[cases, 1] - newValue
 }
do
 Id - numeric(cases)
 while(...) {
 Id[cases] - newValue
 }
 data - data.frame(Id = Id)
This is just the general principal that you don't want to
repeat the same operation over and over in a loop.
dataFrame[i,j] first extracts column j then extracts element
i from that column.  Since the column is the same every iteration
you may as well extract the column outside of the loop.

Avoiding the loop altogether is the fastest.  E.g., the code
you showed does the same thing as
   idLines - grep(value=TRUE, Id:, readLines(file))
   data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines)))
You can also use an external process (perl or grep) to filter
out the lines that are not of interest.


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Freds
 Sent: Wednesday, April 13, 2011 10:58 AM
 To: r-help@r-project.org
 Subject: Re: [R] Incremental ReadLines
 
 Hi there,
 
 I am having a similar problem with reading in a large text 
 file with around
 550.000 observations with each 10 to 100 lines of 
 description. I am trying
 to parse it in R but I have troubles with the size of the 
 file. It seems
 like it is slowing down dramatically at some point. I would 
 be happy for any
 suggestions. Here is my code, which works fine when I am 
 doing a subsample
 of my dataset.
 
 #Defining datasource
 file - filename.txt
 
 #Creating placeholder for data and assigning column names
 data - data.frame(Id=NA)
 
 #Starting by case = 0
 case - 0
 
 #Opening a connection to data
 input - file(file, rt)
 
 #Going through cases
 repeat {
   line - readLines(input, n=1)
   if (length(line)==0) break
   if (length(grep(Id:,line)) != 0) {
 case - case + 1 ; data[case,] -NA
 split_line - strsplit(line,Id:)
 data[case,1] - as.numeric(split_line[[1]][2])
 }
 }
 
 #Closing connection
 close(input)
 
 #Saving dataframe
 write.csv(data,'data.csv')
 
 
 Kind regards,
 
 
 Frederik
 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
447859.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2011-04-14 Thread Frederik Lang

Hi Mike,

Thanks for your comment.

I must admit that I am very new to R and although it sounds interesting what
you write I have no idea of where to start. Can you give some functions or
examples where I can see how it can be done.

I was under the impression that I had to do a loop since my blocks of
observations are of varying length.

Thanks again,

Frederik

On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka marchy...@hotmail.comwrote:






 
  Date: Wed, 13 Apr 2011 10:57:58 -0700
  From: frederikl...@gmail.com
  To: r-help@r-project.org
  Subject: Re: [R] Incremental ReadLines
 
  Hi there,
 
  I am having a similar problem with reading in a large text file with
 around
  550.000 observations with each 10 to 100 lines of description. I am
 trying
  to parse it in R but I have troubles with the size of the file. It seems
  like it is slowing down dramatically at some point. I would be happy for
 any

 This probably occurs when you run out of physical memory but you can
 probably verify by looking at task manager. A readline() method
 wouldn't fit real well with R as you try to had blocks of data
 so that inner loops, implemented largely in native code, can operate
 efficiently. The thing you want is a data structure that can use
 disk more effectively and hide these details from you and algorightm.
 This works best if the algorithm works with data strcuture to avoid
 lots of disk thrashing. You coudl imagine that your read would do
 nothing until each item is needed but often people want the whole
 file validated before procesing, lots of details come up with exception
 handling as you get fancy here. Note of course that your parse output
 could be stored in a hash or something represnting a DOM and this could
 get arbitrarily large. Since it is designed for random access, this may
 cause lots of thrashing if partially on disk. Anything you can do to
 make access patterns more regular, for example sort your data, would help.


  suggestions. Here is my code, which works fine when I am doing a
 subsample
  of my dataset.
 
  #Defining datasource
  file - filename.txt
 
  #Creating placeholder for data and assigning column names
  data - data.frame(Id=NA)
 
  #Starting by case = 0
  case - 0
 
  #Opening a connection to data
  input - file(file, rt)
 
  #Going through cases
  repeat {
  line - readLines(input, n=1)
  if (length(line)==0) break
  if (length(grep(Id:,line)) != 0) {
  case - case + 1 ; data[case,] -NA
  split_line - strsplit(line,Id:)
  data[case,1] - as.numeric(split_line[[1]][2])
  }
  }
 
  #Closing connection
  close(input)
 
  #Saving dataframe
  write.csv(data,'data.csv')
 
 
  Kind regards,
 
 
  Frederik
 
 
  --
  View this message in context:
 http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2011-04-14 Thread Frederik Lang

Hi Bill,

Thank you so much for your suggestions. I will try and alter my code.


Regarding the even shorter solution outside the loop it looks good but my
problem is that not all observations have the same variables so that three
different observations might look like this:


Id: 1
Var1: false
Var2: 6
Var3: 8

Id: 2
missing

Id: 3
Var1: true
3 4 5
Var2: 7
Var3: 3


Doing it without looping through I thought my data had to quite systematic,
which it is not. I might be wrong though.


Thanks again,


Frederik


On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote:

 I have two suggestions to speed up your code, if you
 must use a loop.

 First, don't grow your output dataset at each iteration.
 Instead of
 cases - 0
 output - numeric(cases)
 while(length(line - readLines(input, n=1))==1) {
cases - cases + 1
output[cases] - as.numeric(line)
 }
 preallocate the output vector to be about the size of
 its eventual length (slightly bigger is better), replacing
 output - numeric(0)
 with the likes of
 output - numeric(50)
 and when you are done with the loop trim down the length
 if it is too big
 if (cases  length(output)) length(output) - cases
 Growing your dataset in a loop can cause quadratic or worse
 growth in time with problem size and the above sort of
 code should make the time grow linearly with problem size.

 Second, don't do data.frame subscripting inside your loop.
 Instead of
 data - data.frame(Id=numeric(cases))
 while(...) {
 data[cases, 1] - newValue
 }
 do
 Id - numeric(cases)
 while(...) {
 Id[cases] - newValue
 }
 data - data.frame(Id = Id)
 This is just the general principal that you don't want to
 repeat the same operation over and over in a loop.
 dataFrame[i,j] first extracts column j then extracts element
 i from that column.  Since the column is the same every iteration
 you may as well extract the column outside of the loop.

 Avoiding the loop altogether is the fastest.  E.g., the code
 you showed does the same thing as
   idLines - grep(value=TRUE, Id:, readLines(file))
   data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines)))
 You can also use an external process (perl or grep) to filter
 out the lines that are not of interest.


 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com

  -Original Message-
  From: r-help-boun...@r-project.org
  [mailto:r-help-boun...@r-project.org] On Behalf Of Freds
  Sent: Wednesday, April 13, 2011 10:58 AM
  To: r-help@r-project.org
  Subject: Re: [R] Incremental ReadLines
 
  Hi there,
 
  I am having a similar problem with reading in a large text
  file with around
  550.000 observations with each 10 to 100 lines of
  description. I am trying
  to parse it in R but I have troubles with the size of the
  file. It seems
  like it is slowing down dramatically at some point. I would
  be happy for any
  suggestions. Here is my code, which works fine when I am
  doing a subsample
  of my dataset.
 
  #Defining datasource
  file - filename.txt
 
  #Creating placeholder for data and assigning column names
  data - data.frame(Id=NA)
 
  #Starting by case = 0
  case - 0
 
  #Opening a connection to data
  input - file(file, rt)
 
  #Going through cases
  repeat {
line - readLines(input, n=1)
if (length(line)==0) break
if (length(grep(Id:,line)) != 0) {
  case - case + 1 ; data[case,] -NA
  split_line - strsplit(line,Id:)
  data[case,1] - as.numeric(split_line[[1]][2])
  }
  }
 
  #Closing connection
  close(input)
 
  #Saving dataframe
  write.csv(data,'data.csv')
 
 
  Kind regards,
 
 
  Frederik
 
 
  --
  View this message in context:
  http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
 447859.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2011-04-14 Thread Mike Marchywka










 Date: Thu, 14 Apr 2011 11:57:40 -0400
 Subject: Re: [R] Incremental ReadLines
 From: frederikl...@gmail.com
 To: marchy...@hotmail.com
 CC: r-help@r-project.org

 Hi Mike,

 Thanks for your comment.

 I must admit that I am very new to R and although it sounds interesting
 what you write I have no idea of where to start. Can you give some
 functions or examples where I can see how it can be done.

I'm not sure I have a good R answer, simply pointing out the likley
isuse and maybe the rest belongs on r-develoiper list or something.
If you can determine you are running out of physical memory, then you
either need to partitition somehting or make accesses more regular.
My favorite example from personal experience is sorting a data set
prior to piping into a c++ program that changed the execution time 
substantially by avoiding VM thrashing. R either needs a swapping buffer
or has an equivalent that someone else could mention.



 I was under the impression that I had to do a loop since my blocks of
 observations are of varying length.

 Thanks again,

 Frederik

 On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka
  wrote:





 
  Date: Wed, 13 Apr 2011 10:57:58 -0700
  From: frederikl...@gmail.com
  To: r-help@r-project.org
  Subject: Re: [R] Incremental ReadLines
 
  Hi there,
 
  I am having a similar problem with reading in a large text file with around
  550.000 observations with each 10 to 100 lines of description. I am trying
  to parse it in R but I have troubles with the size of the file. It seems
  like it is slowing down dramatically at some point. I would be happy
 for any

 This probably occurs when you run out of physical memory but you can
 probably verify by looking at task manager. A readline() method
 wouldn't fit real well with R as you try to had blocks of data
 so that inner loops, implemented largely in native code, can operate
 efficiently. The thing you want is a data structure that can use
 disk more effectively and hide these details from you and algorightm.
 This works best if the algorithm works with data strcuture to avoid
 lots of disk thrashing. You coudl imagine that your read would do
 nothing until each item is needed but often people want the whole
 file validated before procesing, lots of details come up with exception
 handling as you get fancy here. Note of course that your parse output
 could be stored in a hash or something represnting a DOM and this could
 get arbitrarily large. Since it is designed for random access, this may
 cause lots of thrashing if partially on disk. Anything you can do to
 make access patterns more regular, for example sort your data, would help.


  suggestions. Here is my code, which works fine when I am doing a subsample
  of my dataset.
 
  #Defining datasource
  file - filename.txt
 
  #Creating placeholder for data and assigning column names
  data - data.frame(Id=NA)
 
  #Starting by case = 0
  case - 0
 
  #Opening a connection to data
  input - file(file, rt)
 
  #Going through cases
  repeat {
  line - readLines(input, n=1)
  if (length(line)==0) break
  if (length(grep(Id:,line)) != 0) {
  case - case + 1 ; data[case,] -NA
  split_line - strsplit(line,Id:)
  data[case,1] - as.numeric(split_line[[1]][2])
  }
  }
 
  #Closing connection
  close(input)
 
  #Saving dataframe
  write.csv(data,'data.csv')
 
 
  Kind regards,
 
 
  Frederik
 
 
  --
  View this message in context:
 http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.


  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2011-04-14 Thread William Dunlap

[see below]

From: Frederik Lang [mailto:frederikl...@gmail.com] 
Sent: Thursday, April 14, 2011 12:56 PM
To: William Dunlap
Cc: r-help@r-project.org
Subject: Re: [R] Incremental ReadLines

Hi Bill,

Thank you so much for your suggestions. I will try and alter my
code.

Regarding the even shorter solution outside the loop it looks
good but my problem is that not all observations have the same variables
so that three different observations might look like this:

Id: 1
Var1: false
Var2: 6
Var3: 8

Id: 2
missing

Id: 3
Var1: true
3 4 5
Var2: 7
Var3: 3

Doing it without looping through I thought my data had to quite
systematic, which it is not. I might be wrong though.

Doing the simple preallocation that I describe should speed it up
a lot with very little effort.  It is more work to manipulate the
columns one at a time instead of using data.frame subscripting and
it may not be worth it if you have lots of columns.

If you have a lot of this sort of file and feel that it will be worth
the programming time to do something fancier, here is some code that
reads lines of the form

 cat(lines, sep=\n)
Id: First
  Var1: false
  Var2: 6
  Var3: 8

Id: Second
Id: Last
  Var1: true
  Var3: 8

and produces a matrix with the Id's along the rows and the Var's
along the columns:

 f(lines)
   Var1Var2 Var3
First  false 6  8
Second NA  NA   NA
Last   true  NA   8

The function f is:

f - function (lines)
{
# keep only lines with colons
lines - grep(value = TRUE, ^.+:, lines)
lines - gsub(^[[:space:]]+|[[:space:]]+$, , lines)
isIdLine - grepl(^Id:, lines)
group - cumsum(isIdLine)
rownames - sub(^Id:[[:space:]]*, , lines[isIdLine])
lines - lines[!isIdLine]
group - group[!isIdLine]
varname - sub([[:space:]]*:.*$, , lines)
value - sub(.*:[[:space:]]*, , lines)
colnames - unique(varname)
col - match(varname, colnames)
retval - array(NA_character_, c(length(rownames),
length(colnames)),
dimnames = list(rownames, colnames))
retval[cbind(group, col)] - value
retval
}

The main trick is the matrix subscript given to retval on the
penultimate line.

Thanks again,

Frederik

On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap
wdun...@tibco.com wrote:

I have two suggestions to speed up your code, if you
must use a loop.

First, don't grow your output dataset at each iteration.
Instead of
cases - 0
output - numeric(cases)
while(length(line - readLines(input, n=1))==1) {
   cases - cases + 1
   output[cases] - as.numeric(line)
}
preallocate the output vector to be about the size of
its eventual length (slightly bigger is better),
replacing
output - numeric(0)
with the likes of
output - numeric(50)
and when you are done with the loop trim down the length
if it is too big
if (cases  length(output)) length(output) - cases
Growing your dataset in a loop can cause quadratic or
worse
growth in time with problem size and the above sort of
code should make the time grow linearly with problem
size.

Second, don't do data.frame subscripting inside your
loop.
Instead of
data - data.frame(Id=numeric(cases))
while(...) {
data[cases, 1] - newValue
}
do
Id - numeric(cases)
while(...) {
Id[cases] - newValue
}
data - data.frame(Id = Id)
This is just the general principal that you don't want
to
repeat the same operation over and over in a loop.
dataFrame[i,j] first extracts column j then extracts
element
i from that column.  Since the column is the same every
iteration
you may as well extract the column outside of the loop.

Avoiding the loop altogether is the fastest.  E.g., the
code
you showed does the same thing as
  idLines - grep(value=TRUE, Id:, readLines(file))
  data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*,
, idLines)))
You can also use an external process (perl or grep) to
filter
out the lines that are not of interest.

Bill

Re: [R] Incremental ReadLines

2009-11-05 Thread Jens Oehlschlägel

Gene,

You might want to look at function read.csv.ffdf from package ff which can read 
large csv-files into a ffdf object. That's kind of data.frame which is stored 
on disk resp. in the file-system-cache. Once you subscript part of it, you get 
a regular data.frame. 


Jens Oehlschlägel
-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2009-11-05 Thread Gabor Grothendieck

If the headers all start with the same letter, A say, and the data
only contain numbers on their lines then just use

read.table(..., comment = A)



On Mon, Nov 2, 2009 at 2:03 PM, Gene Leynes gleyne...@gmail.com wrote:
 I've been trying to figure out how to read in a large file for a few days
 now, and after extensive research I'm still not sure what to do.

 I have a large comma delimited text file that contains 59 fields in each
 record.
 There is also a header every 121 records

 This function works well for smallish records
 getcsv=function(fname){
    ff=file(description = fname)
    x - readLines(ff)
    closeAllConnections()
    x - x[x != ]          # REMOVE BLANKS
    x=x[grep(^[-0-9], x)]  # REMOVE ALL TEXT

    spl=strsplit(x,',')      # THIS PART IS SLOW, BUT MANAGABLE

 xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])
    return(xx)
 }
 It's not elegant, but it works.
 For 121,000 records it completes in 2.3 seconds
 For 121,000*5 records it completes in 63 seconds
 For 121,000*10 records it doesn't complete

 When I try other methods to read the file in chunks (using scan), the
 process breaks down because I have to start at the beginning of the file on
 every iteration.
 For example:
 fnn=function(n,col){
    a=122*(n-1)+2
    xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
    xx=xx[xx!='']
    xx=matrix(xx,ncol=49,byrow=TRUE)
    xx[,col]
 }
 system.time(sapply(1:10,fnn,c=26))     # 0.31 Seconds
 system.time(sapply(91:90,fnn,c=26))    # 1.09 Seconds
 system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds

 Even though I'm only getting the 26th column for 10 sets of records, it
 takes a lot longer the further into the file I go.

 How can I tell scan to pick up where it left off, without it starting at the
 beginning??  There must be a good example somewhere.

 I have done a lot of research (in fact, thank you to Michael J. Crawley and
 others for your help thus far)

 Thanks,

 Gene

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2009-11-02 Thread Duncan Murdoch


On 11/2/2009 2:03 PM, Gene Leynes wrote:

I've been trying to figure out how to read in a large file for a few days
now, and after extensive research I'm still not sure what to do.

I have a large comma delimited text file that contains 59 fields in each
record.
There is also a header every 121 records


You can open the connection before reading, then read in blocks of lines 
and process those.  You don't need to reopen it every time.  For example,


ff - file(fname, open=rt)  # rt is read text
for (block in 1:nblocks) {
  x - readLines(ff, n=121)
  # process this block
}
close(ff)

Duncan Murdoch



This function works well for smallish records
getcsv=function(fname){
ff=file(description = fname)
x - readLines(ff)
closeAllConnections()
x - x[x != ]  # REMOVE BLANKS
x=x[grep(^[-0-9], x)]  # REMOVE ALL TEXT

spl=strsplit(x,',')  # THIS PART IS SLOW, BUT MANAGABLE

xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])
return(xx)
}
It's not elegant, but it works.
For 121,000 records it completes in 2.3 seconds
For 121,000*5 records it completes in 63 seconds
For 121,000*10 records it doesn't complete

When I try other methods to read the file in chunks (using scan), the
process breaks down because I have to start at the beginning of the file on
every iteration.
For example:
fnn=function(n,col){
a=122*(n-1)+2
xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
xx=xx[xx!='']
xx=matrix(xx,ncol=49,byrow=TRUE)
xx[,col]
}
system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds
system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds
system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds

Even though I'm only getting the 26th column for 10 sets of records, it
takes a lot longer the further into the file I go.

How can I tell scan to pick up where it left off, without it starting at the
beginning??  There must be a good example somewhere.

I have done a lot of research (in fact, thank you to Michael J. Crawley and
others for your help thus far)

Thanks,

Gene

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2009-11-02 Thread James W. MacDonald


Hi Gene,

Rather than using R to parse this file, have you considered using either 
grep or sed to pre-process the file and then read it in?


It looks like you just want lines starting with numbers, so something like

grep '^[0-9]\+' thefile.csv  otherfile.csv

should be much faster, and then you can just read in otherfile.csv using 
read.csv().


Best,

Jim



Gene Leynes wrote:

I've been trying to figure out how to read in a large file for a few days
now, and after extensive research I'm still not sure what to do.

I have a large comma delimited text file that contains 59 fields in each
record.
There is also a header every 121 records

This function works well for smallish records
getcsv=function(fname){
ff=file(description = fname)
x - readLines(ff)
closeAllConnections()
x - x[x != ]  # REMOVE BLANKS
x=x[grep(^[-0-9], x)]  # REMOVE ALL TEXT

spl=strsplit(x,',')  # THIS PART IS SLOW, BUT MANAGABLE

xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])
return(xx)
}
It's not elegant, but it works.
For 121,000 records it completes in 2.3 seconds
For 121,000*5 records it completes in 63 seconds
For 121,000*10 records it doesn't complete

When I try other methods to read the file in chunks (using scan), the
process breaks down because I have to start at the beginning of the file on
every iteration.
For example:
fnn=function(n,col){
a=122*(n-1)+2
xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
xx=xx[xx!='']
xx=matrix(xx,ncol=49,byrow=TRUE)
xx[,col]
}
system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds
system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds
system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds

Even though I'm only getting the 26th column for 10 sets of records, it
takes a lot longer the further into the file I go.

How can I tell scan to pick up where it left off, without it starting at the
beginning??  There must be a good example somewhere.

I have done a lot of research (in fact, thank you to Michael J. Crawley and
others for your help thus far)

Thanks,

Gene

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

2009-11-02 Thread Gene Leynes

James,

I think those are Unix commands?  I'm on Windows, so that's not an option
(for now)

Also the suggestions posed by Duncan and Phil seem to be working.  Thank you
so much, such a simple thing to add the r or rt to the file connection.


I read about blocking, but I didn't imagine that it meant chunks.  I was
thinking something more like blocking out, or guarding (perhaps for
security).

On Mon, Nov 2, 2009 at 1:47 PM, James W. MacDonald jmac...@med.umich.eduwrote:

 Hi Gene,

 Rather than using R to parse this file, have you considered using either
 grep or sed to pre-process the file and then read it in?

 It looks like you just want lines starting with numbers, so something like

 grep '^[0-9]\+' thefile.csv  otherfile.csv

 should be much faster, and then you can just read in otherfile.csv using
 read.csv().

 Best,

 Jim



 Gene Leynes wrote:

 I've been trying to figure out how to read in a large file for a few days
 now, and after extensive research I'm still not sure what to do.

 I have a large comma delimited text file that contains 59 fields in each
 record.
 There is also a header every 121 records

 This function works well for smallish records
 getcsv=function(fname){
ff=file(description = fname)
x - readLines(ff)
closeAllConnections()
x - x[x != ]  # REMOVE BLANKS
x=x[grep(^[-0-9], x)]  # REMOVE ALL TEXT

spl=strsplit(x,',')  # THIS PART IS SLOW, BUT MANAGABLE


 xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])
return(xx)
 }
 It's not elegant, but it works.
 For 121,000 records it completes in 2.3 seconds
 For 121,000*5 records it completes in 63 seconds
 For 121,000*10 records it doesn't complete

 When I try other methods to read the file in chunks (using scan), the
 process breaks down because I have to start at the beginning of the file
 on
 every iteration.
 For example:
 fnn=function(n,col){
a=122*(n-1)+2
xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
xx=xx[xx!='']
xx=matrix(xx,ncol=49,byrow=TRUE)
xx[,col]
 }
 system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds
 system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds
 system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds

 Even though I'm only getting the 26th column for 10 sets of records, it
 takes a lot longer the further into the file I go.

 How can I tell scan to pick up where it left off, without it starting at
 the
 beginning??  There must be a good example somewhere.

 I have done a lot of research (in fact, thank you to Michael J. Crawley
 and
 others for your help thus far)

 Thanks,

 Gene

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 --
 James W. MacDonald, M.S.
 Biostatistician
 Douglas Lab
 University of Michigan
 Department of Human Genetics
 5912 Buhl
 1241 E. Catherine St.
 Ann Arbor MI 48109-5618
 734-615-7826


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

Re: [R] Incremental ReadLines

13 matches

Site Navigation

Mail list logo

Footer information