Re: [R] Creating a custom connection to read from multiple files

2005-01-21 Thread Tomas Kalibera
Hello Andy,
thanks for your examples, I rewrote everything to matrices  
lapply/sapply, rbind  calls instead of for-cycles  appends, it really 
helped. Reading files one by one and concatenating is now even faster 
than concatenating on disk, that 8MB table is read in 3.5 seconds.

Tomas
rbind is vectorized so you are using it (way) suboptimally.
   

Here's an example:
 

## Create a 500 x 100 data matrix.
x - matrix(rnorm(5e4), 500, 100)
## Generate 50 filenames.
fname - paste(f, formatC(1:50, width=2, flag=0), .txt, sep=)
## Write the data to files 50 times.
for (f in fname) write(t(x), file=f, ncol=ncol(x))
## Read the files into a list of data frames.
system.time(datList - lapply(fname, read.table, header=FALSE),
   

gcFirst=TRUE)
[1] 11.91  0.05 12.33NANA
 

## Specify colClasses to speed up.
system.time(datList - lapply(fname, read.table,
   

colClasses=rep(numeric, 100)),
+  gcFirst=TRUE)
[1] 10.69  0.07 10.79NANA
 

## Stack them together.
system.time(dat - do.call(rbind, datList), gcFirst=TRUE)
   

[1] 5.34 0.09 5.45   NA   NA
 

## Use matrices instead of data frames.
system.time(datList - lapply(fname, 
   

+  function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE)
Read 5 items
...
Read 5 items
[1]  9.49  0.08 15.06NANA
 

system.time(dat - do.call(rbind, datList), gcFirst=TRUE)
   

[1] 0.09 0.03 0.12   NA   NA
 

## Clean up the files.
unlink(fname)
   

A couple of points:
- Usually specifying colClasses will make read.table() quite a bit 
 faster, even though it's only marginally faster here.  Look back
 in the list archive to see examples.

- If your data files are all numerics (as in this example), 
 storing them in matrices will be much more efficient.  Note
 the difference in rbind()ing the 50 data frames and 50 
 matrices (5.34 seconds vs. 0.09!).  rbind.data.frame()
 needs to ensure that the resulting data frame has unique
 rownames (a requirement for a legit data frame), and
 that's probably taking a big chunk of the time.

Andy
 

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Creating a custom connection to read from multiple files

2005-01-20 Thread Tomas Kalibera
Hello,
is it possible to create my own connection which I could use with
read.table or scan ? I would like to create a connection that would read
from multiple files in sequence (like if they were concatenated),
possibly with an option to skip first n lines of each file. I would like
to avoid using platform specific scripts for that... (currently I invoke
/bin/cat from R to create a concatenation of all those files).
Thanks,
Tomas
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Creating a custom connection to read from multiple files

2005-01-20 Thread Prof Brian Ripley
On Thu, 20 Jan 2005, Tomas Kalibera wrote:
is it possible to create my own connection which I could use with
Yes.  In a sense, all the connections are custom connections written by 
someone.

read.table or scan ? I would like to create a connection that would read
from multiple files in sequence (like if they were concatenated),
possibly with an option to skip first n lines of each file. I would like
to avoid using platform specific scripts for that... (currently I invoke
/bin/cat from R to create a concatenation of all those files).
I would use pipes, but a pure R solution is to process the files to an 
anonymous file() connection and then read that.

However, what is wrong with reading a file at a time and combining the 
results in R using rbind?

--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Creating a custom connection to read from multiple files

2005-01-20 Thread Tomas Kalibera
Dear Prof Ripley,
thanks for your suggestions, it's very nice one can create custom 
connections directly in R and I think it is what I need just now.

However, what is wrong with reading a file at a time and combining the 
results in R using rbind?

Well, the problem is performance. If I concatenate all those files, they 
have around 8MB, can grow to tens of MBs in near future.

Both concatenating and reading from a single file by scan takes 5 
seconds (which is almost OK).

However, reading individual files by read.table and rbinding one by one 
( samples=rbind(samples, newSamples ) takes minutes. The same is when I 
concatenate lists manually. Scan does not help significantly. I guess 
there is some overhead in detecting dimensions of objects in rbind (?) 
or re-allocation or copying data ?

Best regards,
Tomas Kalibera
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Creating a custom connection to read from multiple files

2005-01-20 Thread Prof Brian Ripley
On Thu, 20 Jan 2005, Tomas Kalibera wrote:
Dear Prof Ripley,
thanks for your suggestions, it's very nice one can create custom connections 
directly in R and I think it is what I need just now.

However, what is wrong with reading a file at a time and combining the 
results in R using rbind?

Well, the problem is performance. If I concatenate all those files, they have 
around 8MB, can grow to tens of MBs in near future.

Both concatenating and reading from a single file by scan takes 5 seconds 
(which is almost OK).

However, reading individual files by read.table and rbinding one by one ( 
samples=rbind(samples, newSamples ) takes minutes. The same is when I 
concatenate lists manually. Scan does not help significantly. I guess there 
is some overhead in detecting dimensions of objects in rbind (?) or 
re-allocation or copying data ?
rbind is vectorized so you are using it (way) suboptimally.
--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] Creating a custom connection to read from multiple files

2005-01-20 Thread Liaw, Andy
 From: Prof Brian Ripley
 
 On Thu, 20 Jan 2005, Tomas Kalibera wrote:
 
  Dear Prof Ripley,
 
  thanks for your suggestions, it's very nice one can create 
 custom connections 
  directly in R and I think it is what I need just now.
 
  However, what is wrong with reading a file at a time and 
 combining the 
  results in R using rbind?
  
  Well, the problem is performance. If I concatenate all 
 those files, they have 
  around 8MB, can grow to tens of MBs in near future.
 
  Both concatenating and reading from a single file by scan 
 takes 5 seconds 
  (which is almost OK).
 
  However, reading individual files by read.table and 
 rbinding one by one ( 
  samples=rbind(samples, newSamples ) takes minutes. The same 
 is when I 
  concatenate lists manually. Scan does not help 
 significantly. I guess there 
  is some overhead in detecting dimensions of objects in rbind (?) or 
  re-allocation or copying data ?
 
 rbind is vectorized so you are using it (way) suboptimally.

Here's an example:

  ## Create a 500 x 100 data matrix.
  x - matrix(rnorm(5e4), 500, 100)
  ## Generate 50 filenames.
  fname - paste(f, formatC(1:50, width=2, flag=0), .txt, sep=)
  ## Write the data to files 50 times.
  for (f in fname) write(t(x), file=f, ncol=ncol(x))
  
  ## Read the files into a list of data frames.
  system.time(datList - lapply(fname, read.table, header=FALSE),
gcFirst=TRUE)
[1] 11.91  0.05 12.33NANA
  ## Specify colClasses to speed up.
  system.time(datList - lapply(fname, read.table,
colClasses=rep(numeric, 100)),
+  gcFirst=TRUE)
[1] 10.69  0.07 10.79NANA
  ## Stack them together.
  system.time(dat - do.call(rbind, datList), gcFirst=TRUE)
[1] 5.34 0.09 5.45   NA   NA
  
  ## Use matrices instead of data frames.
  system.time(datList - lapply(fname, 
+  function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE)
Read 5 items
...
Read 5 items
[1]  9.49  0.08 15.06NANA
  system.time(dat - do.call(rbind, datList), gcFirst=TRUE)
[1] 0.09 0.03 0.12   NA   NA
  ## Clean up the files.
  unlink(fname)

A couple of points:

- Usually specifying colClasses will make read.table() quite a bit 
  faster, even though it's only marginally faster here.  Look back
  in the list archive to see examples.

- If your data files are all numerics (as in this example), 
  storing them in matrices will be much more efficient.  Note
  the difference in rbind()ing the 50 data frames and 50 
  matrices (5.34 seconds vs. 0.09!).  rbind.data.frame()
  needs to ensure that the resulting data frame has unique
  rownames (a requirement for a legit data frame), and
  that's probably taking a big chunk of the time.

Andy

 
 -- 
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html