subject:"\[R\] read.table performance"

Re: [R] read.table performance

2011-12-09 Thread Petr PIKAL

 
 By the way, here's my original session information.  (I can never 
remember
 the name of that command when I want it).  It's strange that Petr is 
 having the problem with 2.14.  It's relatively fast on my machine with R 
2.14.  

Probably I use inferior PC based on nowadays standards. WXP, Intel 2,33 
GHz, 2GB memory.

Petr

 
  sessionInfo()
 R version 2.13.0 (2011-04-13)
 Platform: i386-pc-mingw32/i386 (32-bit)
 
 locale:
 [1] LC_COLLATE=English_United States.1252 
 [2] LC_CTYPE=English_United States.1252   
 [3] LC_MONETARY=English_United States.1252
 [4] LC_NUMERIC=C  
 [5] LC_TIME=English_United States.1252
 
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base 

  
 

 On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug r.m.k...@gmail.com 
wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 08/12/11 09:32, Petr PIKAL wrote:
  Hi
 
  system.time(dat-read.table(test2.txt))
  user  system elapsed 32.380.00   32.40
 
  system.time(dat - read.table('test2.txt', nrows=-1, sep='\t',
  header=TRUE)) user  system elapsed 32.300.03   32.36
 
  Couldn't.it be a Windows issue?

 Likely - here on Linux I get:
 
  system.time(dat - read.table('tmp/test2.txt', nrows=-1, sep='\t',
 header=TRUE))
   user  system elapsed
  1.560   0.000   1.579
  sessionInfo()
 R version 2.14.0 (2011-10-31)
 Platform: i686-pc-linux-gnu (32-bit)
 
 locale:
  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=C LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
 
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base
  version
   _
 platform   i686-pc-linux-gnu
 arch   i686
 os linux-gnu
 system i686, linux-gnu
 status
 major  2
 minor  14.0
 year   2011
 month  10
 day31
 svn rev57496
 language   R
 version.string R version 2.14.0 (2011-10-31)
 
 
 
 Cheers,
 
 Rainer
 
  _ platform   i386-pc-mingw32 arch   i386 os
  mingw32 system i386, mingw32 status Under
  development (unstable) major  2 minor  14.0 year
  2011 month  04 day27 svn rev55657
  language   R version.string R version 2.14.0 Under development
  (unstable) (2011-04-27 r55657)
 
 
 
  dim(dat)
  [1]7 3765
 
 
  But from the dat file it seems to me that its structure is somehow
  weird.
 
  head(names(dat))
  [1] X..Hydrogen Helium  Lithium Beryllium   Boron
   [6] Carbon
  tail(names(dat))
  [1] Sulfur.32Chlorine.32  Argon.32 Potassium.32
  Calcium.32 [6] Scandium.32
 
 
  There is row of names which has repeating values. Maybe the most
  time is spent by checking the names validity.
 
  Regards Petr
 
  r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10:
 
  peter dalgaard pda...@gmail.com Odeslal:
  r-help-boun...@r-project.org
 
  07.12.2011 23:11
 
  Komu
 
  R. Michael Weylandt michael.weyla...@gmail.com
 
  Kopie
 
  r-help@r-project.org, Gene Leynes gley...@gmail.com
 
  P?edm?t
 
  Re: [R] read.table performance
 
 
  On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
 
  R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
  verbatim: system.time(read.table(test2.txt))
 
  About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
 
  Gene, are you by any chance storing the file in a heavily
  virus-scanned system directory?
 
  -pd
 
  Michael
 
  2011/12/7 Gene Leynes gley...@gmail.com:
  Peter,
 
  You're quite right; it's nearly impossible to make progress
  without a working example.
 
  I created an ** extremely simplified ** example for
  distribution. The
  real
  data has numeric, character, and boolean classes.
 
  The file still takes 25.08 seconds to read, despite it's
  small size.
 
  I neglected to mention that I'm using R 2.13.0 and Im on a
  windows 7 machine (not that it should particularly matter
  with this type of
  data /
  functions).
 
  ## The code: options(stringsAsFactors=FALSE) system.time(dat
  - read.table('test2.txt', nrows=-1, sep='\t',
  header=TRUE))
  str(dat, 0)
 
 
  Thanks again!
 
 
 
  On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
  pda...@gmail.com
  wrote:
 
 
  On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
 
  Mark,
 
  Thanks for your suggestions.
 
  That's a good idea about the NULL columns; I didn't think
  of that. Surprisingly, it didn't have any effect on the
  time.
 
  Hmm, I think you want character and NULL there (i.e.,
  quoted).
  Did you
  fix both?
 
  read.table(whatever, as.is=TRUE, colClasses =
  c(rep(character,4), rep(NULL,3696)).
 
  As a general matter, if you want people to dig into this,
  they need
  some
  paraphrase of the file to play with. Would it be possible

Re: [R] read.table performance

2011-12-08 Thread Petr PIKAL

Hi

 system.time(dat-read.table(test2.txt))
   user  system elapsed 
  32.380.00   32.40

 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', 
header=TRUE))
   user  system elapsed 
  32.300.03   32.36 

Couldn't.it be a Windows issue?
   _  
platform   i386-pc-mingw32  
arch   i386  
os mingw32  
system i386, mingw32  
status Under development (unstable)  
major  2  
minor  14.0  
year   2011  
month  04  
day27  
svn rev55657  
language   R  
version.string R version 2.14.0 Under development (unstable) (2011-04-27 
r55657)



 dim(dat)
[1]7 3765


But from the dat file it seems to me that its structure is somehow weird. 

 head(names(dat))
[1] X..Hydrogen Helium  Lithium Beryllium   Boron 
[6] Carbon 
 tail(names(dat))
[1] Sulfur.32Chlorine.32  Argon.32 Potassium.32 
Calcium.32 
[6] Scandium.32 


There is row of names which has repeating values. Maybe the most time is 
spent by checking the names validity.

Regards
Petr

r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10:

 peter dalgaard pda...@gmail.com 
 Odeslal: r-help-boun...@r-project.org
 
 07.12.2011 23:11
 
 Komu
 
 R. Michael Weylandt michael.weyla...@gmail.com
 
 Kopie
 
 r-help@r-project.org, Gene Leynes gley...@gmail.com
 
 Předmět
 
 Re: [R] read.table performance
 
 
 On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
 
  R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
  verbatim: system.time(read.table(test2.txt))
 
 About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. 
 
 Gene, are you by any chance storing the file in a heavily virus-scanned 
 system directory?
 
 -pd
 
  Michael
  
  2011/12/7 Gene Leynes gley...@gmail.com:
  Peter,
  
  You're quite right; it's nearly impossible to make progress without a
  working example.
  
  I created an ** extremely simplified ** example for distribution. The 
real
  data has numeric, character, and boolean classes.
  
  The file still takes 25.08 seconds to read, despite it's small size.
  
  I neglected to mention that I'm using R 2.13.0 and Im on a windows 7
  machine (not that it should particularly matter with this type of 
data /
  functions).
  
  ## The code:
  options(stringsAsFactors=FALSE)
  system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', 
header=TRUE))
  str(dat, 0)
  
  
  Thanks again!
  
  
  
  On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com 
wrote:
  
  
  On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
  
  Mark,
  
  Thanks for your suggestions.
  
  That's a good idea about the NULL columns; I didn't think of that.
  Surprisingly, it didn't have any effect on the time.
  
  Hmm, I think you want character and NULL there (i.e., quoted). 
Did you
  fix both?
  
  read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
  rep(NULL,3696)).
  
  As a general matter, if you want people to dig into this, they need 
some
  paraphrase of the file to play with. Would it be possible to set up 
a small
  R program that generates a data file which displays the issue? 
Everything I
  try seems to take about a second to read in.
  
  -pd
  
  
  This problem was just a curiosity, I already did the import using 
Excel
  and
  VBA.  I was just going to illustrate the power and simplicity of R, 
but
  it
  ironically it's been much slower and harder in R...
  The VBA was painful and messy, and took me over an hour to write; 
but at
  least it worked quickly and reliably.
  The R code was clean and only took me about 5 minutes to write, but 
the
  run
  time was prohibitively slow!
  
  I profiled the code, but that offers little insight to me.
  
  Profile results with 10 line file:
  
  summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
  $by.self
  self.time self.pct total.time total.pct
  scan 12.2453.50  12.24 53.50
  read.table   10.5846.24  22.88100.00
  type.convert  0.04 0.17   0.04  0.17
  make.names0.02 0.09   0.02  0.09
  
  $by.total
  total.time total.pct self.time self.pct
  read.table22.88100.00 10.5846.24
  scan  12.24 53.50 12.2453.50
  type.convert   0.04  0.17  0.04 0.17
  make.names 0.02  0.09  0.02 0.09
  
  $sample.interval
  [1] 0.02
  
  $sampling.time
  [1] 22.88
  
  
  Profile results with 250 line file:
  
  summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
  $by.self
  self.time self.pct total.time total.pct
  scan 23.8868.15  23.88 68.15
  read.table   10.7830.76  35.04100.00
  type.convert  0.30 0.86   0.32  0.91
  character 0.02 0.06   0.02  0.06
  file  0.02 0.06   0.02  0.06
  lapply0.02 0.06   0.02  0.06
  unlist0.02 0.06

Re: [R] read.table performance

2011-12-08 Thread Rainer M Krug

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/12/11 09:32, Petr PIKAL wrote:
 Hi
 
 system.time(dat-read.table(test2.txt))
 user  system elapsed 32.380.00   32.40
 
 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t',
 header=TRUE)) user  system elapsed 32.300.03   32.36
 
 Couldn't.it be a Windows issue?

Likely - here on Linux I get:

 system.time(dat - read.table('tmp/test2.txt', nrows=-1, sep='\t',
header=TRUE))
   user  system elapsed
  1.560   0.000   1.579
 sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=C LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base
 version
   _
platform   i686-pc-linux-gnu
arch   i686
os linux-gnu
system i686, linux-gnu
status
major  2
minor  14.0
year   2011
month  10
day31
svn rev57496
language   R
version.string R version 2.14.0 (2011-10-31)
 


Cheers,

Rainer

 _ platform   i386-pc-mingw32 arch   i386 os
 mingw32 system i386, mingw32 status Under
 development (unstable) major  2 minor  14.0 year
 2011 month  04 day27 svn rev55657 
 language   R version.string R version 2.14.0 Under development
 (unstable) (2011-04-27 r55657)
 
 
 
 dim(dat)
 [1]7 3765
 
 
 But from the dat file it seems to me that its structure is somehow
 weird.
 
 head(names(dat))
 [1] X..Hydrogen Helium  Lithium Beryllium   Boron
  [6] Carbon
 tail(names(dat))
 [1] Sulfur.32Chlorine.32  Argon.32 Potassium.32 
 Calcium.32 [6] Scandium.32
 
 
 There is row of names which has repeating values. Maybe the most
 time is spent by checking the names validity.
 
 Regards Petr
 
 r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10:
 
 peter dalgaard pda...@gmail.com Odeslal:
 r-help-boun...@r-project.org
 
 07.12.2011 23:11
 
 Komu
 
 R. Michael Weylandt michael.weyla...@gmail.com
 
 Kopie
 
 r-help@r-project.org, Gene Leynes gley...@gmail.com
 
 P?edm?t
 
 Re: [R] read.table performance
 
 
 On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
 
 R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file 
 verbatim: system.time(read.table(test2.txt))
 
 About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
 
 Gene, are you by any chance storing the file in a heavily
 virus-scanned system directory?
 
 -pd
 
 Michael
 
 2011/12/7 Gene Leynes gley...@gmail.com:
 Peter,
 
 You're quite right; it's nearly impossible to make progress
 without a working example.
 
 I created an ** extremely simplified ** example for
 distribution. The
 real
 data has numeric, character, and boolean classes.
 
 The file still takes 25.08 seconds to read, despite it's
 small size.
 
 I neglected to mention that I'm using R 2.13.0 and Im on a
 windows 7 machine (not that it should particularly matter
 with this type of
 data /
 functions).
 
 ## The code: options(stringsAsFactors=FALSE) system.time(dat
 - read.table('test2.txt', nrows=-1, sep='\t',
 header=TRUE))
 str(dat, 0)
 
 
 Thanks again!
 
 
 
 On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
 pda...@gmail.com
 wrote:
 
 
 On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
 
 Mark,
 
 Thanks for your suggestions.
 
 That's a good idea about the NULL columns; I didn't think
 of that. Surprisingly, it didn't have any effect on the
 time.
 
 Hmm, I think you want character and NULL there (i.e.,
 quoted).
 Did you
 fix both?
 
 read.table(whatever, as.is=TRUE, colClasses =
 c(rep(character,4), rep(NULL,3696)).
 
 As a general matter, if you want people to dig into this,
 they need
 some
 paraphrase of the file to play with. Would it be possible
 to set up
 a small
 R program that generates a data file which displays the
 issue?
 Everything I
 try seems to take about a second to read in.
 
 -pd
 
 
 This problem was just a curiosity, I already did the
 import using
 Excel
 and
 VBA.  I was just going to illustrate the power and
 simplicity of R,
 but
 it
 ironically it's been much slower and harder in R... The
 VBA was painful and messy, and took me over an hour to
 write;
 but at
 least it worked quickly and reliably. The R code was
 clean and only took me about 5 minutes to write, but
 the
 run
 time was prohibitively slow!
 
 I profiled the code, but that offers little insight to
 me.
 
 Profile results with 10 line file:
 
 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self self.time self.pct total.time total.pct scan
 12.2453.50  12.24 53.50 read.table
 10.5846.24  22.88100.00 type.convert
 0.04 0.17   0.04  0.17 make.names

Re: [R] read.table performance

2011-12-08 Thread Gene Leynes

By the way, here's my original session information.  (I can never remember
the name of that command when I want it).  It's strange that Petr is having
the problem with 2.14.  It's relatively fast on my machine with R 2.14.

 sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug r.m.k...@gmail.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 08/12/11 09:32, Petr PIKAL wrote:
  Hi
 
  system.time(dat-read.table(test2.txt))
  user  system elapsed 32.380.00   32.40
 
  system.time(dat - read.table('test2.txt', nrows=-1, sep='\t',
  header=TRUE)) user  system elapsed 32.300.03   32.36
 
  Couldn't.it be a Windows issue?

 Likely - here on Linux I get:

  system.time(dat - read.table('tmp/test2.txt', nrows=-1, sep='\t',
 header=TRUE))
   user  system elapsed
  1.560   0.000   1.579
  sessionInfo()
 R version 2.14.0 (2011-10-31)
 Platform: i686-pc-linux-gnu (32-bit)

 locale:
  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=C LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base
  version
   _
 platform   i686-pc-linux-gnu
 arch   i686
 os linux-gnu
 system i686, linux-gnu
 status
 major  2
 minor  14.0
 year   2011
 month  10
 day31
 svn rev57496
 language   R
 version.string R version 2.14.0 (2011-10-31)
 


 Cheers,

 Rainer

  _ platform   i386-pc-mingw32 arch   i386 os
  mingw32 system i386, mingw32 status Under
  development (unstable) major  2 minor  14.0 year
  2011 month  04 day27 svn rev55657
  language   R version.string R version 2.14.0 Under development
  (unstable) (2011-04-27 r55657)
 
 
 
  dim(dat)
  [1]7 3765
 
 
  But from the dat file it seems to me that its structure is somehow
  weird.
 
  head(names(dat))
  [1] X..Hydrogen Helium  Lithium Beryllium   Boron
   [6] Carbon
  tail(names(dat))
  [1] Sulfur.32Chlorine.32  Argon.32 Potassium.32
  Calcium.32 [6] Scandium.32
 
 
  There is row of names which has repeating values. Maybe the most
  time is spent by checking the names validity.
 
  Regards Petr
 
  r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10:
 
  peter dalgaard pda...@gmail.com Odeslal:
  r-help-boun...@r-project.org
 
  07.12.2011 23:11
 
  Komu
 
  R. Michael Weylandt michael.weyla...@gmail.com
 
  Kopie
 
  r-help@r-project.org, Gene Leynes gley...@gmail.com
 
  P?edm?t
 
  Re: [R] read.table performance
 
 
  On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
 
  R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
  verbatim: system.time(read.table(test2.txt))
 
  About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
 
  Gene, are you by any chance storing the file in a heavily
  virus-scanned system directory?
 
  -pd
 
  Michael
 
  2011/12/7 Gene Leynes gley...@gmail.com:
  Peter,
 
  You're quite right; it's nearly impossible to make progress
  without a working example.
 
  I created an ** extremely simplified ** example for
  distribution. The
  real
  data has numeric, character, and boolean classes.
 
  The file still takes 25.08 seconds to read, despite it's
  small size.
 
  I neglected to mention that I'm using R 2.13.0 and Im on a
  windows 7 machine (not that it should particularly matter
  with this type of
  data /
  functions).
 
  ## The code: options(stringsAsFactors=FALSE) system.time(dat
  - read.table('test2.txt', nrows=-1, sep='\t',
  header=TRUE))
  str(dat, 0)
 
 
  Thanks again!
 
 
 
  On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
  pda...@gmail.com
  wrote:
 
 
  On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
 
  Mark,
 
  Thanks for your suggestions.
 
  That's a good idea about the NULL columns; I didn't think
  of that. Surprisingly, it didn't have any effect on the
  time.
 
  Hmm, I think you want character and NULL there (i.e.,
  quoted).
  Did you
  fix both?
 
  read.table(whatever, as.is=TRUE, colClasses =
  c(rep(character,4), rep(NULL,3696)).
 
  As a general matter, if you want people to dig into this,
  they need
  some
  paraphrase of the file to play with. Would it be possible
  to set up
  a small
  R program that generates a data file which displays the
  issue?
  Everything I
  try seems to take about a second to read in.
 
  -pd

Re: [R] read.table performance

2011-12-08 Thread Gene Leynes

Now this is interesting:

Here's a list of how long it took to read the same file in various versions
of R:

R version 2.10.1 (2009-12-14) 3.97
R version 2.12.0 (2010-10-15)24.53
R version 2.13.0 (2011-04-13)24.48
R version 2.14.0 (2011-10-31) 3.75

I think that the even numbered releases of R are generally faster with
read.table (except 2.12), kind of like how Beethoven's odd numbered
symphonies are generally more popular (except 6)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.table performance

2011-12-07 Thread jim holtman

Here is a test that I ran where the difference was rather the data was
in a single column or 3700 columns.  If in a single column, the 'scan'
and 'read.table' were comparable; with 3700 columns, read.table took
3X longer.  using 'colClasses' did not make a difference:

 x.n - as.character(runif(3700))
 x.f - tempfile()
 # just write out a file of numbers in a single column
 # 3700 * 500  =  1.85M lines
 writeLines(rep(x.n, 500), con = x.f)
 file.info(x.f)

size isdir mode   mtime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
35154500 FALSE  666 2011-12-07 06:13:56

ctime   atime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
2011-12-07 06:13:52 2011-12-07 06:13:52
 exe
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064  no
 system.time(x.n.read - scan(x.f))
Read 185 items
   user  system elapsed
   4.040.054.10
 dim(x.n.read)
NULL
 object.size(x.n.read)
14800040 bytes
 system.time(x.n.read - read.table(x.f))  # comparible to 'scan'
   user  system elapsed
   4.680.064.74
 object.size(x.n.read)
14800672 bytes

 # now create data with 3700 columns
 # and 500 rows  (1.85M numbers)
 x.long - paste(x.n, collapse = ',')
 writeLines(rep(x.long, 500), con = x.f)
 file.info(x.f)

size isdir mode   mtime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
33305000 FALSE  666 2011-12-07 06:14:11

ctime   atime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
2011-12-07 06:13:52 2011-12-07 06:13:52

C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064  no
 system.time(x.long.read - scan(x.f, sep = ','))
Read 185 items
   user  system elapsed
   4.210.024.23
 dim(x.long.read)
NULL
 object.size(x.long.read)
14800040 bytes
 # takes 3 times as long as 'scan'
 system.time(x.long.read - read.table(x.f, sep = ','))
   user  system elapsed
  13.240.06   13.33
 dim(x.long.read)
[1]  500 3700
 object.size(x.long.read)
15185368 bytes


 # using colClasses
 system.time(x.long.read - read.table(x.f, sep = ','
+ , colClasses = rep('numeric', 3700)
+ )
+ )
   user  system elapsed
  12.390.06   12.48




On Tue, Dec 6, 2011 at 4:33 PM, Gene Leynes gley...@gmail.com wrote:
 Mark,

 Thanks for your suggestions.

 That's a good idea about the NULL columns; I didn't think of that.
 Surprisingly, it didn't have any effect on the time.

 This problem was just a curiosity, I already did the import using Excel and
 VBA.  I was just going to illustrate the power and simplicity of R, but it
 ironically it's been much slower and harder in R...
 The VBA was painful and messy, and took me over an hour to write; but at
 least it worked quickly and reliably.
 The R code was clean and only took me about 5 minutes to write, but the run
 time was prohibitively slow!

 I profiled the code, but that offers little insight to me.

 Profile results with 10 line file:

 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
             self.time self.pct total.time total.pct
 scan             12.24    53.50      12.24     53.50
 read.table       10.58    46.24      22.88    100.00
 type.convert      0.04     0.17       0.04      0.17
 make.names        0.02     0.09       0.02      0.09

 $by.total
             total.time total.pct self.time self.pct
 read.table        22.88    100.00     10.58    46.24
 scan              12.24     53.50     12.24    53.50
 type.convert       0.04      0.17      0.04     0.17
 make.names         0.02      0.09      0.02     0.09

 $sample.interval
 [1] 0.02

 $sampling.time
 [1] 22.88


 Profile results with 250 line file:

 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
             self.time self.pct total.time total.pct
 scan             23.88    68.15      23.88     68.15
 read.table       10.78    30.76      35.04    100.00
 type.convert      0.30     0.86       0.32      0.91
 character         0.02     0.06       0.02      0.06
 file              0.02     0.06       0.02      0.06
 lapply            0.02     0.06       0.02      0.06
 unlist            0.02     0.06       0.02      0.06

 $by.total
               total.time total.pct self.time self.pct
 read.table          35.04    100.00     10.78    30.76
 scan                23.88     68.15     23.88    68.15
 type.convert         0.32      0.91      0.30     0.86
 sapply               0.04      0.11      0.00     0.00
 character            0.02      0.06      0.02     0.06
 file                 0.02      0.06      0.02     0.06
 lapply               0.02      0.06      0.02     0.06
 unlist               0.02      0.06      0.02     0.06
 simplify2array       0.02      0.06      0.00     0.00

 $sample.interval
 [1] 0.02

 $sampling.time
 [1] 35.04




 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote:

 hi gene: maybe someone else will reply with some

Re: [R] read.table performance

2011-12-07 Thread R. Michael Weylandt

R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
verbatim: system.time(read.table(test2.txt))

Michael

2011/12/7 Gene Leynes gley...@gmail.com:
 Peter,

 You're quite right; it's nearly impossible to make progress without a
 working example.

 I created an ** extremely simplified ** example for distribution.  The real
 data has numeric, character, and boolean classes.

 The file still takes 25.08 seconds to read, despite it's small size.

 I neglected to mention that I'm using R 2.13.0 and Im on a windows 7
 machine (not that it should particularly matter with this type of data /
 functions).

 ## The code:
 options(stringsAsFactors=FALSE)
 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE))
 str(dat, 0)


 Thanks again!



 On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote:


 On Dec 6, 2011, at 22:33 , Gene Leynes wrote:

  Mark,
 
  Thanks for your suggestions.
 
  That's a good idea about the NULL columns; I didn't think of that.
  Surprisingly, it didn't have any effect on the time.

 Hmm, I think you want character and NULL there (i.e., quoted). Did you
 fix both?

  read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
  rep(NULL,3696)).

 As a general matter, if you want people to dig into this, they need some
 paraphrase of the file to play with. Would it be possible to set up a small
 R program that generates a data file which displays the issue? Everything I
 try seems to take about a second to read in.

 -pd

 
  This problem was just a curiosity, I already did the import using Excel
 and
  VBA.  I was just going to illustrate the power and simplicity of R, but
 it
  ironically it's been much slower and harder in R...
  The VBA was painful and messy, and took me over an hour to write; but at
  least it worked quickly and reliably.
  The R code was clean and only took me about 5 minutes to write, but the
 run
  time was prohibitively slow!
 
  I profiled the code, but that offers little insight to me.
 
  Profile results with 10 line file:
 
  summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
  $by.self
              self.time self.pct total.time total.pct
  scan             12.24    53.50      12.24     53.50
  read.table       10.58    46.24      22.88    100.00
  type.convert      0.04     0.17       0.04      0.17
  make.names        0.02     0.09       0.02      0.09
 
  $by.total
              total.time total.pct self.time self.pct
  read.table        22.88    100.00     10.58    46.24
  scan              12.24     53.50     12.24    53.50
  type.convert       0.04      0.17      0.04     0.17
  make.names         0.02      0.09      0.02     0.09
 
  $sample.interval
  [1] 0.02
 
  $sampling.time
  [1] 22.88
 
 
  Profile results with 250 line file:
 
  summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
  $by.self
              self.time self.pct total.time total.pct
  scan             23.88    68.15      23.88     68.15
  read.table       10.78    30.76      35.04    100.00
  type.convert      0.30     0.86       0.32      0.91
  character         0.02     0.06       0.02      0.06
  file              0.02     0.06       0.02      0.06
  lapply            0.02     0.06       0.02      0.06
  unlist            0.02     0.06       0.02      0.06
 
  $by.total
                total.time total.pct self.time self.pct
  read.table          35.04    100.00     10.78    30.76
  scan                23.88     68.15     23.88    68.15
  type.convert         0.32      0.91      0.30     0.86
  sapply               0.04      0.11      0.00     0.00
  character            0.02      0.06      0.02     0.06
  file                 0.02      0.06      0.02     0.06
  lapply               0.02      0.06      0.02     0.06
  unlist               0.02      0.06      0.02     0.06
  simplify2array       0.02      0.06      0.00     0.00
 
  $sample.interval
  [1] 0.02
 
  $sampling.time
  [1] 35.04
 
 
 
 
  On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote:
 
  hi gene: maybe someone else will reply with some  subtleties that I'm
 not
  aware of. one other thing
  that might help: if you know which columns you want , you can set the
  others to NULL through
  colClasses and this should speed things up also. For example, say you
 knew
  you only wanted the
  first four columns and they were character. then you could do,
 
  read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
  rep(NULL,3696)).
 
  hopefully someone else will say something that does the trick. it seems
  odd to me as far as the
  difference in timings ? good luck.
 
 
 
 
 
  On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote:
 
  Mark,
 
  Thank you for the reply
 
  I neglected to mention that I had already set
  options(stringsAsFactors=FALSE)
 
  I agree, skipping the factor determination can help performance.
 
  The main reason that I wanted to use read.table is because it will
  correctly determine the column

Re: [R] read.table performance

2011-12-07 Thread peter dalgaard


On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:

 R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
 verbatim: system.time(read.table(test2.txt))

About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. 

Gene, are you by any chance storing the file in a heavily virus-scanned system 
directory?

-pd

 Michael
 
 2011/12/7 Gene Leynes gley...@gmail.com:
 Peter,
 
 You're quite right; it's nearly impossible to make progress without a
 working example.
 
 I created an ** extremely simplified ** example for distribution.  The real
 data has numeric, character, and boolean classes.
 
 The file still takes 25.08 seconds to read, despite it's small size.
 
 I neglected to mention that I'm using R 2.13.0 and Im on a windows 7
 machine (not that it should particularly matter with this type of data /
 functions).
 
 ## The code:
 options(stringsAsFactors=FALSE)
 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE))
 str(dat, 0)
 
 
 Thanks again!
 
 
 
 On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote:
 
 
 On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
 
 Mark,
 
 Thanks for your suggestions.
 
 That's a good idea about the NULL columns; I didn't think of that.
 Surprisingly, it didn't have any effect on the time.
 
 Hmm, I think you want character and NULL there (i.e., quoted). Did you
 fix both?
 
 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).
 
 As a general matter, if you want people to dig into this, they need some
 paraphrase of the file to play with. Would it be possible to set up a small
 R program that generates a data file which displays the issue? Everything I
 try seems to take about a second to read in.
 
 -pd
 
 
 This problem was just a curiosity, I already did the import using Excel
 and
 VBA.  I was just going to illustrate the power and simplicity of R, but
 it
 ironically it's been much slower and harder in R...
 The VBA was painful and messy, and took me over an hour to write; but at
 least it worked quickly and reliably.
 The R code was clean and only took me about 5 minutes to write, but the
 run
 time was prohibitively slow!
 
 I profiled the code, but that offers little insight to me.
 
 Profile results with 10 line file:
 
 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
 self.time self.pct total.time total.pct
 scan 12.2453.50  12.24 53.50
 read.table   10.5846.24  22.88100.00
 type.convert  0.04 0.17   0.04  0.17
 make.names0.02 0.09   0.02  0.09
 
 $by.total
 total.time total.pct self.time self.pct
 read.table22.88100.00 10.5846.24
 scan  12.24 53.50 12.2453.50
 type.convert   0.04  0.17  0.04 0.17
 make.names 0.02  0.09  0.02 0.09
 
 $sample.interval
 [1] 0.02
 
 $sampling.time
 [1] 22.88
 
 
 Profile results with 250 line file:
 
 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
 self.time self.pct total.time total.pct
 scan 23.8868.15  23.88 68.15
 read.table   10.7830.76  35.04100.00
 type.convert  0.30 0.86   0.32  0.91
 character 0.02 0.06   0.02  0.06
 file  0.02 0.06   0.02  0.06
 lapply0.02 0.06   0.02  0.06
 unlist0.02 0.06   0.02  0.06
 
 $by.total
   total.time total.pct self.time self.pct
 read.table  35.04100.00 10.7830.76
 scan23.88 68.15 23.8868.15
 type.convert 0.32  0.91  0.30 0.86
 sapply   0.04  0.11  0.00 0.00
 character0.02  0.06  0.02 0.06
 file 0.02  0.06  0.02 0.06
 lapply   0.02  0.06  0.02 0.06
 unlist   0.02  0.06  0.02 0.06
 simplify2array   0.02  0.06  0.00 0.00
 
 $sample.interval
 [1] 0.02
 
 $sampling.time
 [1] 35.04
 
 
 
 
 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote:
 
 hi gene: maybe someone else will reply with some  subtleties that I'm
 not
 aware of. one other thing
 that might help: if you know which columns you want , you can set the
 others to NULL through
 colClasses and this should speed things up also. For example, say you
 knew
 you only wanted the
 first four columns and they were character. then you could do,
 
 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).
 
 hopefully someone else will say something that does the trick. it seems
 odd to me as far as the
 difference in timings ? good luck.
 
 
 
 
 
 On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote:
 
 Mark,
 
 Thank you for the reply
 
 I neglected to mention that I had already set
 options(stringsAsFactors=FALSE)
 
 I agree,

Re: [R] read.table performance

2011-12-07 Thread Gene Leynes

No, it was just on my desktop (and on a network drive, and in a temp
folder on my c drive).

There have been some new policies put into place at work though, and
perhaps that includes more / some monitoring software, but I don't
know.

Sent from my iPhone

On Dec 7, 2011, at 4:11 PM, peter dalgaard pda...@gmail.com wrote:


 On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:

 R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
 verbatim: system.time(read.table(test2.txt))

 About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.

 Gene, are you by any chance storing the file in a heavily virus-scanned 
 system directory?

 -pd

 Michael

 2011/12/7 Gene Leynes gley...@gmail.com:
 Peter,

 You're quite right; it's nearly impossible to make progress without a
 working example.

 I created an ** extremely simplified ** example for distribution.  The real
 data has numeric, character, and boolean classes.

 The file still takes 25.08 seconds to read, despite it's small size.

 I neglected to mention that I'm using R 2.13.0 and Im on a windows 7
 machine (not that it should particularly matter with this type of data /
 functions).

 ## The code:
 options(stringsAsFactors=FALSE)
 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE))
 str(dat, 0)


 Thanks again!



 On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote:


 On Dec 6, 2011, at 22:33 , Gene Leynes wrote:

 Mark,

 Thanks for your suggestions.

 That's a good idea about the NULL columns; I didn't think of that.
 Surprisingly, it didn't have any effect on the time.

 Hmm, I think you want character and NULL there (i.e., quoted). Did you
 fix both?

 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).

 As a general matter, if you want people to dig into this, they need some
 paraphrase of the file to play with. Would it be possible to set up a small
 R program that generates a data file which displays the issue? Everything I
 try seems to take about a second to read in.

 -pd


 This problem was just a curiosity, I already did the import using Excel
 and
 VBA.  I was just going to illustrate the power and simplicity of R, but
 it
 ironically it's been much slower and harder in R...
 The VBA was painful and messy, and took me over an hour to write; but at
 least it worked quickly and reliably.
 The R code was clean and only took me about 5 minutes to write, but the
 run
 time was prohibitively slow!

 I profiled the code, but that offers little insight to me.

 Profile results with 10 line file:

 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
self.time self.pct total.time total.pct
 scan 12.2453.50  12.24 53.50
 read.table   10.5846.24  22.88100.00
 type.convert  0.04 0.17   0.04  0.17
 make.names0.02 0.09   0.02  0.09

 $by.total
total.time total.pct self.time self.pct
 read.table22.88100.00 10.5846.24
 scan  12.24 53.50 12.2453.50
 type.convert   0.04  0.17  0.04 0.17
 make.names 0.02  0.09  0.02 0.09

 $sample.interval
 [1] 0.02

 $sampling.time
 [1] 22.88


 Profile results with 250 line file:

 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
self.time self.pct total.time total.pct
 scan 23.8868.15  23.88 68.15
 read.table   10.7830.76  35.04100.00
 type.convert  0.30 0.86   0.32  0.91
 character 0.02 0.06   0.02  0.06
 file  0.02 0.06   0.02  0.06
 lapply0.02 0.06   0.02  0.06
 unlist0.02 0.06   0.02  0.06

 $by.total
  total.time total.pct self.time self.pct
 read.table  35.04100.00 10.7830.76
 scan23.88 68.15 23.8868.15
 type.convert 0.32  0.91  0.30 0.86
 sapply   0.04  0.11  0.00 0.00
 character0.02  0.06  0.02 0.06
 file 0.02  0.06  0.02 0.06
 lapply   0.02  0.06  0.02 0.06
 unlist   0.02  0.06  0.02 0.06
 simplify2array   0.02  0.06  0.00 0.00

 $sample.interval
 [1] 0.02

 $sampling.time
 [1] 35.04




 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote:

 hi gene: maybe someone else will reply with some  subtleties that I'm
 not
 aware of. one other thing
 that might help: if you know which columns you want , you can set the
 others to NULL through
 colClasses and this should speed things up also. For example, say you
 knew
 you only wanted the
 first four columns and they were character. then you could do,

 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).

 hopefully someone else will say something that does the trick. it

[R] read.table performance

2011-12-06 Thread Gene Leynes

** Disclaimer: I'm looking for general suggestions **
I'm sorry, but can't send out the file I'm using, so there is no
reproducible example.

I'm using read.table and it's taking over 30 seconds to read a tiny file.
The strange thing is that it takes roughly the same amount of time if the
file is 100 times larger.

After re-reviewing the data Import / Export manual I think the best
approach would be to use Python, or perhaps the readLines function, but I
was hoping to understand why the simple read.table approach wasn't working
as expected.

Some relevant facts:

   1. There are about 3700 columns.  Maybe this is the problem?  Still the
   file size is not very large.
   2. The file encoding is ANSI, but I'm not specifying that in the
   function.  Setting fileEncoding=ANSI produces an unsupported conversion
   error
   3. readLines imports the lines quickly
   4. scan imports the file quickly also

Obviously, scan and readLines would require more coding to identify
columns, etc.

my code:
system.time(dat - read.table('C:/test.txt', nrows=-1, sep='\t',
header=TRUE))

It's taking 33.4 seconds and the file size is only 315 kb!

Thanks

Gene

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.table performance

2011-12-06 Thread Gabor Grothendieck

On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes gley...@gmail.com wrote:
 ** Disclaimer: I'm looking for general suggestions **
 I'm sorry, but can't send out the file I'm using, so there is no
 reproducible example.

 I'm using read.table and it's taking over 30 seconds to read a tiny file.
 The strange thing is that it takes roughly the same amount of time if the
 file is 100 times larger.

 After re-reviewing the data Import / Export manual I think the best
 approach would be to use Python, or perhaps the readLines function, but I
 was hoping to understand why the simple read.table approach wasn't working
 as expected.

 Some relevant facts:

   1. There are about 3700 columns.  Maybe this is the problem?  Still the
   file size is not very large.
   2. The file encoding is ANSI, but I'm not specifying that in the
   function.  Setting fileEncoding=ANSI produces an unsupported conversion
   error
   3. readLines imports the lines quickly
   4. scan imports the file quickly also

 Obviously, scan and readLines would require more coding to identify
 columns, etc.

 my code:
 system.time(dat - read.table('C:/test.txt', nrows=-1, sep='\t',
 header=TRUE))

 It's taking 33.4 seconds and the file size is only 315 kb!


You could also try read.csv.sql in the sqldf package and see whether
or not that is any faster. Be sure you are using RSQLite 0.11.0 (and
not an earlier version) with that since earlier versions were compiled
to work with only a maximum of 999 columns.

library(sqldf)
DF - read.csv.sql(C:\\test.txt, header = TRUE, sep = \t)

You may or may not have to use the eol= argument to specify line
endings.  See ?read.csv.sql

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.table performance

2011-12-06 Thread Gene Leynes

Mark,

Thanks for your suggestions.

That's a good idea about the NULL columns; I didn't think of that.
Surprisingly, it didn't have any effect on the time.

This problem was just a curiosity, I already did the import using Excel and
VBA.  I was just going to illustrate the power and simplicity of R, but it
ironically it's been much slower and harder in R...
The VBA was painful and messy, and took me over an hour to write; but at
least it worked quickly and reliably.
The R code was clean and only took me about 5 minutes to write, but the run
time was prohibitively slow!

I profiled the code, but that offers little insight to me.

Profile results with 10 line file:

 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
$by.self
 self.time self.pct total.time total.pct
scan 12.2453.50  12.24 53.50
read.table   10.5846.24  22.88100.00
type.convert  0.04 0.17   0.04  0.17
make.names0.02 0.09   0.02  0.09

$by.total
 total.time total.pct self.time self.pct
read.table22.88100.00 10.5846.24
scan  12.24 53.50 12.2453.50
type.convert   0.04  0.17  0.04 0.17
make.names 0.02  0.09  0.02 0.09

$sample.interval
[1] 0.02

$sampling.time
[1] 22.88


Profile results with 250 line file:

 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
$by.self
 self.time self.pct total.time total.pct
scan 23.8868.15  23.88 68.15
read.table   10.7830.76  35.04100.00
type.convert  0.30 0.86   0.32  0.91
character 0.02 0.06   0.02  0.06
file  0.02 0.06   0.02  0.06
lapply0.02 0.06   0.02  0.06
unlist0.02 0.06   0.02  0.06

$by.total
   total.time total.pct self.time self.pct
read.table  35.04100.00 10.7830.76
scan23.88 68.15 23.8868.15
type.convert 0.32  0.91  0.30 0.86
sapply   0.04  0.11  0.00 0.00
character0.02  0.06  0.02 0.06
file 0.02  0.06  0.02 0.06
lapply   0.02  0.06  0.02 0.06
unlist   0.02  0.06  0.02 0.06
simplify2array   0.02  0.06  0.00 0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 35.04




On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote:

 hi gene: maybe someone else will reply with some  subtleties that I'm not
 aware of. one other thing
 that might help: if you know which columns you want , you can set the
 others to NULL through
 colClasses and this should speed things up also. For example, say you knew
 you only wanted the
 first four columns and they were character. then you could do,

 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).

 hopefully someone else will say something that does the trick. it seems
 odd to me as far as the
 difference in timings ? good luck.





 On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote:

 Mark,

 Thank you for the reply

 I neglected to mention that I had already set
 options(stringsAsFactors=FALSE)

 I agree, skipping the factor determination can help performance.

 The main reason that I wanted to use read.table is because it will
 correctly determine the column classes for me.  I don't really want to
 specify 3700 column classes!  (I'm not sure what they are anyway).


 On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds marklee...@gmail.com wrote:

 Hi Gene: Sometimes using colClasses in read.table can speed things up.
 If you know what your variables are ahead of time and what you want them to
 be, this allows you to be specific  by specifying, character or numeric,
 etc  and often it makes things faster. others will have more to say.

 also, if most of your variables are characters, R will try to turn
 convert them into factors by default. If you use as.is = TRUE it won't
 do this and that might speed things up also.


 Rejoinder:  above tidbits are  just from experience. I don't know if
 it's in stone or a hard and fast rule.








 On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes gley...@gmail.com wrote:

 ** Disclaimer: I'm looking for general suggestions **
 I'm sorry, but can't send out the file I'm using, so there is no
 reproducible example.

 I'm using read.table and it's taking over 30 seconds to read a tiny
 file.
 The strange thing is that it takes roughly the same amount of time if
 the
 file is 100 times larger.

 After re-reviewing the data Import / Export manual I think the best
 approach would be to use Python, or perhaps the readLines function, but
 I
 was hoping to understand why the simple read.table approach wasn't
 working
 as expected.

 Some relevant facts:

   1. There are about 3700 columns.  Maybe this is the problem?  Still
 the

Re: [R] read.table performance

2011-12-06 Thread peter dalgaard


On Dec 6, 2011, at 22:33 , Gene Leynes wrote:

 Mark,
 
 Thanks for your suggestions.
 
 That's a good idea about the NULL columns; I didn't think of that.
 Surprisingly, it didn't have any effect on the time.

Hmm, I think you want character and NULL there (i.e., quoted). Did you fix 
both? 

 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).

As a general matter, if you want people to dig into this, they need some 
paraphrase of the file to play with. Would it be possible to set up a small R 
program that generates a data file which displays the issue? Everything I try 
seems to take about a second to read in.

-pd

 
 This problem was just a curiosity, I already did the import using Excel and
 VBA.  I was just going to illustrate the power and simplicity of R, but it
 ironically it's been much slower and harder in R...
 The VBA was painful and messy, and took me over an hour to write; but at
 least it worked quickly and reliably.
 The R code was clean and only took me about 5 minutes to write, but the run
 time was prohibitively slow!
 
 I profiled the code, but that offers little insight to me.
 
 Profile results with 10 line file:
 
 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
 self.time self.pct total.time total.pct
 scan 12.2453.50  12.24 53.50
 read.table   10.5846.24  22.88100.00
 type.convert  0.04 0.17   0.04  0.17
 make.names0.02 0.09   0.02  0.09
 
 $by.total
 total.time total.pct self.time self.pct
 read.table22.88100.00 10.5846.24
 scan  12.24 53.50 12.2453.50
 type.convert   0.04  0.17  0.04 0.17
 make.names 0.02  0.09  0.02 0.09
 
 $sample.interval
 [1] 0.02
 
 $sampling.time
 [1] 22.88
 
 
 Profile results with 250 line file:
 
 summaryRprof(C:/Users/gene.leynes/Desktop/test.out)
 $by.self
 self.time self.pct total.time total.pct
 scan 23.8868.15  23.88 68.15
 read.table   10.7830.76  35.04100.00
 type.convert  0.30 0.86   0.32  0.91
 character 0.02 0.06   0.02  0.06
 file  0.02 0.06   0.02  0.06
 lapply0.02 0.06   0.02  0.06
 unlist0.02 0.06   0.02  0.06
 
 $by.total
   total.time total.pct self.time self.pct
 read.table  35.04100.00 10.7830.76
 scan23.88 68.15 23.8868.15
 type.convert 0.32  0.91  0.30 0.86
 sapply   0.04  0.11  0.00 0.00
 character0.02  0.06  0.02 0.06
 file 0.02  0.06  0.02 0.06
 lapply   0.02  0.06  0.02 0.06
 unlist   0.02  0.06  0.02 0.06
 simplify2array   0.02  0.06  0.00 0.00
 
 $sample.interval
 [1] 0.02
 
 $sampling.time
 [1] 35.04
 
 
 
 
 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote:
 
 hi gene: maybe someone else will reply with some  subtleties that I'm not
 aware of. one other thing
 that might help: if you know which columns you want , you can set the
 others to NULL through
 colClasses and this should speed things up also. For example, say you knew
 you only wanted the
 first four columns and they were character. then you could do,
 
 read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
 rep(NULL,3696)).
 
 hopefully someone else will say something that does the trick. it seems
 odd to me as far as the
 difference in timings ? good luck.
 
 
 
 
 
 On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote:
 
 Mark,
 
 Thank you for the reply
 
 I neglected to mention that I had already set
 options(stringsAsFactors=FALSE)
 
 I agree, skipping the factor determination can help performance.
 
 The main reason that I wanted to use read.table is because it will
 correctly determine the column classes for me.  I don't really want to
 specify 3700 column classes!  (I'm not sure what they are anyway).
 
 
 On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds marklee...@gmail.com wrote:
 
 Hi Gene: Sometimes using colClasses in read.table can speed things up.
 If you know what your variables are ahead of time and what you want them to
 be, this allows you to be specific  by specifying, character or numeric,
 etc  and often it makes things faster. others will have more to say.
 
 also, if most of your variables are characters, R will try to turn
 convert them into factors by default. If you use as.is = TRUE it won't
 do this and that might speed things up also.
 
 
 Rejoinder:  above tidbits are  just from experience. I don't know if
 it's in stone or a hard and fast rule.
 
 
 
 
 
 
 
 
 On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes gley...@gmail.com wrote:
 
 ** Disclaimer: I'm looking for general suggestions **
 I'm

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

[R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

Re: [R] read.table performance

13 matches

Site Navigation

Mail list logo

Footer information