Re: [R] read.table performance
By the way, here's my original session information. (I can never remember the name of that command when I want it). It's strange that Petr is having the problem with 2.14. It's relatively fast on my machine with R 2.14. Probably I use inferior PC based on nowadays standards. WXP, Intel 2,33 GHz, 2GB memory. Petr sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug r.m.k...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/12/11 09:32, Petr PIKAL wrote: Hi system.time(dat-read.table(test2.txt)) user system elapsed 32.380.00 32.40 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 32.300.03 32.36 Couldn't.it be a Windows issue? Likely - here on Linux I get: system.time(dat - read.table('tmp/test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 1.560 0.000 1.579 sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 14.0 year 2011 month 10 day31 svn rev57496 language R version.string R version 2.14.0 (2011-10-31) Cheers, Rainer _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status Under development (unstable) major 2 minor 14.0 year 2011 month 04 day27 svn rev55657 language R version.string R version 2.14.0 Under development (unstable) (2011-04-27 r55657) dim(dat) [1]7 3765 But from the dat file it seems to me that its structure is somehow weird. head(names(dat)) [1] X..Hydrogen Helium Lithium Beryllium Boron [6] Carbon tail(names(dat)) [1] Sulfur.32Chlorine.32 Argon.32 Potassium.32 Calcium.32 [6] Scandium.32 There is row of names which has repeating values. Maybe the most time is spent by checking the names validity. Regards Petr r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: peter dalgaard pda...@gmail.com Odeslal: r-help-boun...@r-project.org 07.12.2011 23:11 Komu R. Michael Weylandt michael.weyla...@gmail.com Kopie r-help@r-project.org, Gene Leynes gley...@gmail.com P?edm?t Re: [R] read.table performance On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible
Re: [R] read.table performance
Hi system.time(dat-read.table(test2.txt)) user system elapsed 32.380.00 32.40 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 32.300.03 32.36 Couldn't.it be a Windows issue? _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status Under development (unstable) major 2 minor 14.0 year 2011 month 04 day27 svn rev55657 language R version.string R version 2.14.0 Under development (unstable) (2011-04-27 r55657) dim(dat) [1]7 3765 But from the dat file it seems to me that its structure is somehow weird. head(names(dat)) [1] X..Hydrogen Helium Lithium Beryllium Boron [6] Carbon tail(names(dat)) [1] Sulfur.32Chlorine.32 Argon.32 Potassium.32 Calcium.32 [6] Scandium.32 There is row of names which has repeating values. Maybe the most time is spent by checking the names validity. Regards Petr r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: peter dalgaard pda...@gmail.com Odeslal: r-help-boun...@r-project.org 07.12.2011 23:11 Komu R. Michael Weylandt michael.weyla...@gmail.com Kopie r-help@r-project.org, Gene Leynes gley...@gmail.com Předmět Re: [R] read.table performance On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.2453.50 12.24 53.50 read.table 10.5846.24 22.88100.00 type.convert 0.04 0.17 0.04 0.17 make.names0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table22.88100.00 10.5846.24 scan 12.24 53.50 12.2453.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.8868.15 23.88 68.15 read.table 10.7830.76 35.04100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply0.02 0.06 0.02 0.06 unlist0.02 0.06
Re: [R] read.table performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/12/11 09:32, Petr PIKAL wrote: Hi system.time(dat-read.table(test2.txt)) user system elapsed 32.380.00 32.40 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 32.300.03 32.36 Couldn't.it be a Windows issue? Likely - here on Linux I get: system.time(dat - read.table('tmp/test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 1.560 0.000 1.579 sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 14.0 year 2011 month 10 day31 svn rev57496 language R version.string R version 2.14.0 (2011-10-31) Cheers, Rainer _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status Under development (unstable) major 2 minor 14.0 year 2011 month 04 day27 svn rev55657 language R version.string R version 2.14.0 Under development (unstable) (2011-04-27 r55657) dim(dat) [1]7 3765 But from the dat file it seems to me that its structure is somehow weird. head(names(dat)) [1] X..Hydrogen Helium Lithium Beryllium Boron [6] Carbon tail(names(dat)) [1] Sulfur.32Chlorine.32 Argon.32 Potassium.32 Calcium.32 [6] Scandium.32 There is row of names which has repeating values. Maybe the most time is spent by checking the names validity. Regards Petr r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: peter dalgaard pda...@gmail.com Odeslal: r-help-boun...@r-project.org 07.12.2011 23:11 Komu R. Michael Weylandt michael.weyla...@gmail.com Kopie r-help@r-project.org, Gene Leynes gley...@gmail.com P?edm?t Re: [R] read.table performance On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.2453.50 12.24 53.50 read.table 10.5846.24 22.88100.00 type.convert 0.04 0.17 0.04 0.17 make.names
Re: [R] read.table performance
By the way, here's my original session information. (I can never remember the name of that command when I want it). It's strange that Petr is having the problem with 2.14. It's relatively fast on my machine with R 2.14. sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug r.m.k...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/12/11 09:32, Petr PIKAL wrote: Hi system.time(dat-read.table(test2.txt)) user system elapsed 32.380.00 32.40 system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 32.300.03 32.36 Couldn't.it be a Windows issue? Likely - here on Linux I get: system.time(dat - read.table('tmp/test2.txt', nrows=-1, sep='\t', header=TRUE)) user system elapsed 1.560 0.000 1.579 sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 14.0 year 2011 month 10 day31 svn rev57496 language R version.string R version 2.14.0 (2011-10-31) Cheers, Rainer _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status Under development (unstable) major 2 minor 14.0 year 2011 month 04 day27 svn rev55657 language R version.string R version 2.14.0 Under development (unstable) (2011-04-27 r55657) dim(dat) [1]7 3765 But from the dat file it seems to me that its structure is somehow weird. head(names(dat)) [1] X..Hydrogen Helium Lithium Beryllium Boron [6] Carbon tail(names(dat)) [1] Sulfur.32Chlorine.32 Argon.32 Potassium.32 Calcium.32 [6] Scandium.32 There is row of names which has repeating values. Maybe the most time is spent by checking the names validity. Regards Petr r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: peter dalgaard pda...@gmail.com Odeslal: r-help-boun...@r-project.org 07.12.2011 23:11 Komu R. Michael Weylandt michael.weyla...@gmail.com Kopie r-help@r-project.org, Gene Leynes gley...@gmail.com P?edm?t Re: [R] read.table performance On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd
Re: [R] read.table performance
Now this is interesting: Here's a list of how long it took to read the same file in various versions of R: R version 2.10.1 (2009-12-14) 3.97 R version 2.12.0 (2010-10-15)24.53 R version 2.13.0 (2011-04-13)24.48 R version 2.14.0 (2011-10-31) 3.75 I think that the even numbered releases of R are generally faster with read.table (except 2.12), kind of like how Beethoven's odd numbered symphonies are generally more popular (except 6) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.table performance
Here is a test that I ran where the difference was rather the data was in a single column or 3700 columns. If in a single column, the 'scan' and 'read.table' were comparable; with 3700 columns, read.table took 3X longer. using 'colClasses' did not make a difference: x.n - as.character(runif(3700)) x.f - tempfile() # just write out a file of numbers in a single column # 3700 * 500 = 1.85M lines writeLines(rep(x.n, 500), con = x.f) file.info(x.f) size isdir mode mtime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 35154500 FALSE 666 2011-12-07 06:13:56 ctime atime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 2011-12-07 06:13:52 2011-12-07 06:13:52 exe C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no system.time(x.n.read - scan(x.f)) Read 185 items user system elapsed 4.040.054.10 dim(x.n.read) NULL object.size(x.n.read) 14800040 bytes system.time(x.n.read - read.table(x.f)) # comparible to 'scan' user system elapsed 4.680.064.74 object.size(x.n.read) 14800672 bytes # now create data with 3700 columns # and 500 rows (1.85M numbers) x.long - paste(x.n, collapse = ',') writeLines(rep(x.long, 500), con = x.f) file.info(x.f) size isdir mode mtime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 33305000 FALSE 666 2011-12-07 06:14:11 ctime atime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 2011-12-07 06:13:52 2011-12-07 06:13:52 C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no system.time(x.long.read - scan(x.f, sep = ',')) Read 185 items user system elapsed 4.210.024.23 dim(x.long.read) NULL object.size(x.long.read) 14800040 bytes # takes 3 times as long as 'scan' system.time(x.long.read - read.table(x.f, sep = ',')) user system elapsed 13.240.06 13.33 dim(x.long.read) [1] 500 3700 object.size(x.long.read) 15185368 bytes # using colClasses system.time(x.long.read - read.table(x.f, sep = ',' + , colClasses = rep('numeric', 3700) + ) + ) user system elapsed 12.390.06 12.48 On Tue, Dec 6, 2011 at 4:33 PM, Gene Leynes gley...@gmail.com wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.24 53.50 12.24 53.50 read.table 10.58 46.24 22.88 100.00 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table 22.88 100.00 10.58 46.24 scan 12.24 53.50 12.24 53.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.88 68.15 23.88 68.15 read.table 10.78 30.76 35.04 100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 $by.total total.time total.pct self.time self.pct read.table 35.04 100.00 10.78 30.76 scan 23.88 68.15 23.88 68.15 type.convert 0.32 0.91 0.30 0.86 sapply 0.04 0.11 0.00 0.00 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 simplify2array 0.02 0.06 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote: hi gene: maybe someone else will reply with some
Re: [R] read.table performance
R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.24 53.50 12.24 53.50 read.table 10.58 46.24 22.88 100.00 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table 22.88 100.00 10.58 46.24 scan 12.24 53.50 12.24 53.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.88 68.15 23.88 68.15 read.table 10.78 30.76 35.04 100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 $by.total total.time total.pct self.time self.pct read.table 35.04 100.00 10.78 30.76 scan 23.88 68.15 23.88 68.15 type.convert 0.32 0.91 0.30 0.86 sapply 0.04 0.11 0.00 0.00 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 simplify2array 0.02 0.06 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote: hi gene: maybe someone else will reply with some subtleties that I'm not aware of. one other thing that might help: if you know which columns you want , you can set the others to NULL through colClasses and this should speed things up also. For example, say you knew you only wanted the first four columns and they were character. then you could do, read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). hopefully someone else will say something that does the trick. it seems odd to me as far as the difference in timings ? good luck. On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote: Mark, Thank you for the reply I neglected to mention that I had already set options(stringsAsFactors=FALSE) I agree, skipping the factor determination can help performance. The main reason that I wanted to use read.table is because it will correctly determine the column
Re: [R] read.table performance
On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.2453.50 12.24 53.50 read.table 10.5846.24 22.88100.00 type.convert 0.04 0.17 0.04 0.17 make.names0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table22.88100.00 10.5846.24 scan 12.24 53.50 12.2453.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.8868.15 23.88 68.15 read.table 10.7830.76 35.04100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply0.02 0.06 0.02 0.06 unlist0.02 0.06 0.02 0.06 $by.total total.time total.pct self.time self.pct read.table 35.04100.00 10.7830.76 scan23.88 68.15 23.8868.15 type.convert 0.32 0.91 0.30 0.86 sapply 0.04 0.11 0.00 0.00 character0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 simplify2array 0.02 0.06 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote: hi gene: maybe someone else will reply with some subtleties that I'm not aware of. one other thing that might help: if you know which columns you want , you can set the others to NULL through colClasses and this should speed things up also. For example, say you knew you only wanted the first four columns and they were character. then you could do, read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). hopefully someone else will say something that does the trick. it seems odd to me as far as the difference in timings ? good luck. On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote: Mark, Thank you for the reply I neglected to mention that I had already set options(stringsAsFactors=FALSE) I agree,
Re: [R] read.table performance
No, it was just on my desktop (and on a network drive, and in a temp folder on my c drive). There have been some new policies put into place at work though, and perhaps that includes more / some monitoring software, but I don't know. Sent from my iPhone On Dec 7, 2011, at 4:11 PM, peter dalgaard pda...@gmail.com wrote: On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file verbatim: system.time(read.table(test2.txt)) About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd Michael 2011/12/7 Gene Leynes gley...@gmail.com: Peter, You're quite right; it's nearly impossible to make progress without a working example. I created an ** extremely simplified ** example for distribution. The real data has numeric, character, and boolean classes. The file still takes 25.08 seconds to read, despite it's small size. I neglected to mention that I'm using R 2.13.0 and Im on a windows 7 machine (not that it should particularly matter with this type of data / functions). ## The code: options(stringsAsFactors=FALSE) system.time(dat - read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) str(dat, 0) Thanks again! On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard pda...@gmail.com wrote: On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.2453.50 12.24 53.50 read.table 10.5846.24 22.88100.00 type.convert 0.04 0.17 0.04 0.17 make.names0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table22.88100.00 10.5846.24 scan 12.24 53.50 12.2453.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.8868.15 23.88 68.15 read.table 10.7830.76 35.04100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply0.02 0.06 0.02 0.06 unlist0.02 0.06 0.02 0.06 $by.total total.time total.pct self.time self.pct read.table 35.04100.00 10.7830.76 scan23.88 68.15 23.8868.15 type.convert 0.32 0.91 0.30 0.86 sapply 0.04 0.11 0.00 0.00 character0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 simplify2array 0.02 0.06 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote: hi gene: maybe someone else will reply with some subtleties that I'm not aware of. one other thing that might help: if you know which columns you want , you can set the others to NULL through colClasses and this should speed things up also. For example, say you knew you only wanted the first four columns and they were character. then you could do, read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). hopefully someone else will say something that does the trick. it
[R] read.table performance
** Disclaimer: I'm looking for general suggestions ** I'm sorry, but can't send out the file I'm using, so there is no reproducible example. I'm using read.table and it's taking over 30 seconds to read a tiny file. The strange thing is that it takes roughly the same amount of time if the file is 100 times larger. After re-reviewing the data Import / Export manual I think the best approach would be to use Python, or perhaps the readLines function, but I was hoping to understand why the simple read.table approach wasn't working as expected. Some relevant facts: 1. There are about 3700 columns. Maybe this is the problem? Still the file size is not very large. 2. The file encoding is ANSI, but I'm not specifying that in the function. Setting fileEncoding=ANSI produces an unsupported conversion error 3. readLines imports the lines quickly 4. scan imports the file quickly also Obviously, scan and readLines would require more coding to identify columns, etc. my code: system.time(dat - read.table('C:/test.txt', nrows=-1, sep='\t', header=TRUE)) It's taking 33.4 seconds and the file size is only 315 kb! Thanks Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.table performance
On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes gley...@gmail.com wrote: ** Disclaimer: I'm looking for general suggestions ** I'm sorry, but can't send out the file I'm using, so there is no reproducible example. I'm using read.table and it's taking over 30 seconds to read a tiny file. The strange thing is that it takes roughly the same amount of time if the file is 100 times larger. After re-reviewing the data Import / Export manual I think the best approach would be to use Python, or perhaps the readLines function, but I was hoping to understand why the simple read.table approach wasn't working as expected. Some relevant facts: 1. There are about 3700 columns. Maybe this is the problem? Still the file size is not very large. 2. The file encoding is ANSI, but I'm not specifying that in the function. Setting fileEncoding=ANSI produces an unsupported conversion error 3. readLines imports the lines quickly 4. scan imports the file quickly also Obviously, scan and readLines would require more coding to identify columns, etc. my code: system.time(dat - read.table('C:/test.txt', nrows=-1, sep='\t', header=TRUE)) It's taking 33.4 seconds and the file size is only 315 kb! You could also try read.csv.sql in the sqldf package and see whether or not that is any faster. Be sure you are using RSQLite 0.11.0 (and not an earlier version) with that since earlier versions were compiled to work with only a maximum of 999 columns. library(sqldf) DF - read.csv.sql(C:\\test.txt, header = TRUE, sep = \t) You may or may not have to use the eol= argument to specify line endings. See ?read.csv.sql -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.table performance
Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.2453.50 12.24 53.50 read.table 10.5846.24 22.88100.00 type.convert 0.04 0.17 0.04 0.17 make.names0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table22.88100.00 10.5846.24 scan 12.24 53.50 12.2453.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.8868.15 23.88 68.15 read.table 10.7830.76 35.04100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply0.02 0.06 0.02 0.06 unlist0.02 0.06 0.02 0.06 $by.total total.time total.pct self.time self.pct read.table 35.04100.00 10.7830.76 scan23.88 68.15 23.8868.15 type.convert 0.32 0.91 0.30 0.86 sapply 0.04 0.11 0.00 0.00 character0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 simplify2array 0.02 0.06 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote: hi gene: maybe someone else will reply with some subtleties that I'm not aware of. one other thing that might help: if you know which columns you want , you can set the others to NULL through colClasses and this should speed things up also. For example, say you knew you only wanted the first four columns and they were character. then you could do, read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). hopefully someone else will say something that does the trick. it seems odd to me as far as the difference in timings ? good luck. On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote: Mark, Thank you for the reply I neglected to mention that I had already set options(stringsAsFactors=FALSE) I agree, skipping the factor determination can help performance. The main reason that I wanted to use read.table is because it will correctly determine the column classes for me. I don't really want to specify 3700 column classes! (I'm not sure what they are anyway). On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds marklee...@gmail.com wrote: Hi Gene: Sometimes using colClasses in read.table can speed things up. If you know what your variables are ahead of time and what you want them to be, this allows you to be specific by specifying, character or numeric, etc and often it makes things faster. others will have more to say. also, if most of your variables are characters, R will try to turn convert them into factors by default. If you use as.is = TRUE it won't do this and that might speed things up also. Rejoinder: above tidbits are just from experience. I don't know if it's in stone or a hard and fast rule. On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes gley...@gmail.com wrote: ** Disclaimer: I'm looking for general suggestions ** I'm sorry, but can't send out the file I'm using, so there is no reproducible example. I'm using read.table and it's taking over 30 seconds to read a tiny file. The strange thing is that it takes roughly the same amount of time if the file is 100 times larger. After re-reviewing the data Import / Export manual I think the best approach would be to use Python, or perhaps the readLines function, but I was hoping to understand why the simple read.table approach wasn't working as expected. Some relevant facts: 1. There are about 3700 columns. Maybe this is the problem? Still the
Re: [R] read.table performance
On Dec 6, 2011, at 22:33 , Gene Leynes wrote: Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. Hmm, I think you want character and NULL there (i.e., quoted). Did you fix both? read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in. -pd This problem was just a curiosity, I already did the import using Excel and VBA. I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 12.2453.50 12.24 53.50 read.table 10.5846.24 22.88100.00 type.convert 0.04 0.17 0.04 0.17 make.names0.02 0.09 0.02 0.09 $by.total total.time total.pct self.time self.pct read.table22.88100.00 10.5846.24 scan 12.24 53.50 12.2453.50 type.convert 0.04 0.17 0.04 0.17 make.names 0.02 0.09 0.02 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file: summaryRprof(C:/Users/gene.leynes/Desktop/test.out) $by.self self.time self.pct total.time total.pct scan 23.8868.15 23.88 68.15 read.table 10.7830.76 35.04100.00 type.convert 0.30 0.86 0.32 0.91 character 0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply0.02 0.06 0.02 0.06 unlist0.02 0.06 0.02 0.06 $by.total total.time total.pct self.time self.pct read.table 35.04100.00 10.7830.76 scan23.88 68.15 23.8868.15 type.convert 0.32 0.91 0.30 0.86 sapply 0.04 0.11 0.00 0.00 character0.02 0.06 0.02 0.06 file 0.02 0.06 0.02 0.06 lapply 0.02 0.06 0.02 0.06 unlist 0.02 0.06 0.02 0.06 simplify2array 0.02 0.06 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds marklee...@gmail.com wrote: hi gene: maybe someone else will reply with some subtleties that I'm not aware of. one other thing that might help: if you know which columns you want , you can set the others to NULL through colClasses and this should speed things up also. For example, say you knew you only wanted the first four columns and they were character. then you could do, read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). hopefully someone else will say something that does the trick. it seems odd to me as far as the difference in timings ? good luck. On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes gley...@gmail.com wrote: Mark, Thank you for the reply I neglected to mention that I had already set options(stringsAsFactors=FALSE) I agree, skipping the factor determination can help performance. The main reason that I wanted to use read.table is because it will correctly determine the column classes for me. I don't really want to specify 3700 column classes! (I'm not sure what they are anyway). On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds marklee...@gmail.com wrote: Hi Gene: Sometimes using colClasses in read.table can speed things up. If you know what your variables are ahead of time and what you want them to be, this allows you to be specific by specifying, character or numeric, etc and often it makes things faster. others will have more to say. also, if most of your variables are characters, R will try to turn convert them into factors by default. If you use as.is = TRUE it won't do this and that might speed things up also. Rejoinder: above tidbits are just from experience. I don't know if it's in stone or a hard and fast rule. On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes gley...@gmail.com wrote: ** Disclaimer: I'm looking for general suggestions ** I'm