[Rd] S3 best practice
Hello everyone Suppose I have an S3 class dog and a function plot.dog() which looks like this: plot.dog - function(x,show.uncertainty, ...){ do a simple plot if (show.uncertainty){ perform complicated combinatorial stuff that takes 20 minutes and superimpose the results on the simple plot } } I think that it would be better to somehow precalculate the uncertainty stuff and plot it separately. How best to do this in the context of an S3 method for plot()? What is Best Practice here? -- Robin Hankin Uncertainty Analyst National Oceanography Centre, Southampton European Way, Southampton SO14 3ZH, UK tel 023-8059-7743 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Install.packages() bug in Windows XP (PR#9540)
Dear User=B4s, =20 run R2.2.0 for Windows Version (S.O. Windows XP) when install.packages() function, after select the mirror, shown: --- Please select a CRAN mirror for use in this session --- Aviso: unable to access index for repository http://cran.br.r-project.org/bin/windows/contrib/2.2 Aviso: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.2 Erro em install.packages() : argumento pkgs ausente, sem padr=E3o Thanks, =20 =20 Eng=BA Juan S. Ramseyer. =20 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Install.packages() bug in Windows XP (PR#9540)
This is not a bug. And if a bug, you are asked to only report bugs of recent versions of R! Please ask questions on R-help! 1. Please check your firewall and proxy settings. 2. Please upgrade to a recent version of R. Uwe Ligges [EMAIL PROTECTED] wrote: Dear User=B4s, =20 run R2.2.0 for Windows Version (S.O. Windows XP) when install.packages() function, after select the mirror, shown: --- Please select a CRAN mirror for use in this session --- Aviso: unable to access index for repository http://cran.br.r-project.org/bin/windows/contrib/2.2 Aviso: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.2 Erro em install.packages() : argumento pkgs ausente, sem padr=E3o Thanks, =20 =20 Eng=BA Juan S. Ramseyer. =20 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] S3 best practice
Robin Hankin [EMAIL PROTECTED] writes: Hello everyone Suppose I have an S3 class dog and a function plot.dog() which looks like this: plot.dog - function(x,show.uncertainty, ...){ do a simple plot if (show.uncertainty){ perform complicated combinatorial stuff that takes 20 minutes and superimpose the results on the simple plot } } How uncertain is the dog in the window? I think that it would be better to somehow precalculate the uncertainty stuff and plot it separately. How best to do this in the context of an S3 method for plot()? Doing long computations within plot functions can be annoying because often one needs to tweak the visual style of a plot and this requires numerous round trips. So I like your idea of precomputing the uncertainty stuff. uncertainty.dog could return data that could then optionally be passed into the plot method. Another possibility is that the dog class could store the uncertainty data and then the plot method would plot it if it is there (and/or if an option to plot is given). In this case, I guess it would be: x - addUncertainty(x) + seth __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Wishlist: Make screeplot() a generic (PR#9541)
Full_Name: Gavin Simpson Version: 2.5.0 OS: Linux (FC5) Submission from: (NULL) (128.40.33.76) Screeplots are a common plot-type used to interpret the results of various ordination methods and other techniques. A number of packages include ordination techniques not included in a standard R installation. screeplot() works for princomp and prcomp objects, but not for these other techniques as it was not designed to do so. The current situation means, for example, that I have called a function Screeplot() in one of my packages, but it would be easier for users if they only had to remember to use screeplot() to generate a screeplot. I would like to request that screeplot be made generic and methods for prcomp and princomp added to R devel. This way, package authors can provide screeplot methods for their functions as appropriate. I have taken a look at the sources for R devel (from the SVN repository) in file princomp-add.R and prcomp.R and it looks a relatively simple change to make screeplot generic. I would be happy to provide patches and documentation if R Core were interested in making this change - I haven't done this yet as I don't want to spend time doing something that might not be acceptable to R core in general. Many thanks, G __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] extracting rows from a data frame by looping over the row names: performance issues
Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Extracting rows from data frames is tricky, since each of the columns could be of a different class. For your toy example, it seems a matrix would be a more reasonable option. R-devel has some improvements to row extraction, if I remember correctly. You might want to try your example there. -roger Herve Pages wrote: Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Roger D. Peng | http://www.biostat.jhsph.edu/~rpeng/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Your 2 examples have 2 differences and they are therefore confounded in their effects. What are your results for: system.time(for (i in 1:100) {row - dat[i, ] }) -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Herve Pages Sent: Friday, March 02, 2007 11:40 AM To: r-devel@r-project.org Subject: [Rd] extracting rows from a data frame by looping over the row names: performance issues Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_ MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_A DDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Hi Hervé depending on your problem, using mapply might help, as in the code example below: a = data.frame(matrix(1:3e4, ncol=3)) print(system.time({ r1 = numeric(nrow(a)) for(i in seq_len(nrow(a))) { g = a[i,] r1[i] = mean(c(g$X1, g$X2, g$X3)) }})) print(system.time({ f = function(X1,X2,X3) mean(c(X1, X2, X3)) r2 = do.call(mapply, args=append(f, a)) })) print(identical(r1, r2)) # user system elapsed 6.049 0.200 6.987 user system elapsed 0.508 0.000 0.509 [1] TRUE Best wishes Wolfgang Roger D. Peng wrote: Extracting rows from data frames is tricky, since each of the columns could be of a different class. For your toy example, it seems a matrix would be a more reasonable option. R-devel has some improvements to row extraction, if I remember correctly. You might want to try your example there. -roger Herve Pages wrote: Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Best wishes Wolfgang -- Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Patch for format.pval limitation in format.R
'format.pval' has a major limitation in its implementation for example suppose a person had a vector like 'a' and the error being ±0.001. a - c(0.1, 0.3, 0.4, 0.5, 0.3, 0.0001) format.pval(a, eps=0.001) The person wants to have the 'format.pval' output with 2 digits always showing like this [1] 0.10 0.30 0.40 0.50 0.30 0.001 How ever format.pval can only display this [1] 0.10.30.40.50.30.001 If this was the 'format' function this could be corrected by setting the 'nsmall' argument to 2. But 'format.pval' has no ability to pass arguments to format. I think that the best solution would be to give 'format.pval' a '...' argument that would get passed to all the 'format' function calls in 'format.pval'. I have attached a patch that does this. This patch is against svn r-release-branch, but it also works with r-devel. Charles Dupont -- Charles Dupont Computer System Analyst School of Medicine Department of Biostatistics Vanderbilt University Index: src/library/base/R/format.R === --- src/library/base/R/format.R (revision 40768) +++ src/library/base/R/format.R (working copy) @@ -43,7 +43,7 @@ } format.pval - function(pv, digits = max(1, getOption(digits)-2), - eps = .Machine$double.eps, na.form = NA) + eps = .Machine$double.eps, na.form = NA, ...) { ## Format P values; auxiliary for print.summary.[g]lm(.) @@ -55,8 +55,8 @@ ## be smart -- differ for fixp. and expon. display: expo - floor(log10(ifelse(pv 0, pv, 1e-50))) fixp - expo = -3 | (expo == -4 digits1) - if(any( fixp)) rr[ fixp] - format(pv[ fixp], dig=digits) - if(any(!fixp)) rr[!fixp] - format(pv[!fixp], dig=digits) + if(any( fixp)) rr[ fixp] - format(pv[ fixp], dig=digits, ...) + if(any(!fixp)) rr[!fixp] - format(pv[!fixp], dig=digits, ...) r[!is0]- rr } if(any(is0)) { @@ -67,7 +67,7 @@ digits - max(1, nc - 7) sep - if(digits==1 nc = 6) else } else sep - if(digits==1) else - r[is0] - paste(, format(eps, digits=digits), sep = sep) + r[is0] - paste(, format(eps, digits=digits, ...), sep = sep) } if(has.na) { ## rarely rok - r Index: src/library/base/man/format.pval.Rd === --- src/library/base/man/format.pval.Rd (revision 40768) +++ src/library/base/man/format.pval.Rd (working copy) @@ -6,13 +6,14 @@ \alias{format.pval} \usage{ format.pval(pv, digits = max(1, getOption(digits) - 2), -eps = .Machine$double.eps, na.form = NA) +eps = .Machine$double.eps, na.form = NA, \dots) } \arguments{ \item{pv}{a numeric vector.} \item{digits}{how many significant digits are to be used.} \item{eps}{a numerical tolerance: see Details.} \item{na.form}{character representation of \code{NA}s.} + \item{\dots}{arguments passed to the \code{\link{format}} function.} } \value{ A character vector. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Roger D. Peng wrote: Extracting rows from data frames is tricky, since each of the columns could be of a different class. For your toy example, it seems a matrix would be a more reasonable option. There is no doubt about this ;-) mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) With the matrix: system.time(for (i in 1:100) { row - mat[i, ] }) user system elapsed 0 0 0 With the data frame: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.565 0.296 12.859 And even with a mixed-type data frame, it's very tempting to convert it to a matrix before to do any loop on it: dat2 - as.data.frame(mat, stringsAsFactors=FALSE) dat2 - cbind(dat2, ii=1:30) sapply(dat2, typeof) V1 V2 V3 V4 V5 ii character character character character character integer system.time(for (key in row.names(dat2)[1:100]) { row - dat2[key, ] }) user system elapsed 13.201 0.144 13.360 system.time({mat2 - as.matrix(dat2); for (i in 1:100) { row - mat2[i, ] }}) user system elapsed 0.128 0.036 0.163 Big win isn't it? (only if you have enough memory for it though...) Cheers, H. R-devel has some improvements to row extraction, if I remember correctly. You might want to try your example there. -roger Herve Pages wrote: Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Here is an even faster one; the general point is to create a properly vectorized custom function/expression: mymean - function(x, y, z) (x+y+z)/3 a = data.frame(matrix(1:3e4, ncol=3)) attach(a) print(system.time({r3 = mymean(X1,X2,X3)})) detach(a) # Yields: # [1] 0.000 0.010 0.005 0.000 0.000 print(identical(r2, r3)) # [1] TRUE # May values for version 1 and 2 resp. were # time for r1: [1] 29.420 23.090 60.093 0.000 0.000 # time for r2: [1] 1.400 0.050 1.505 0.000 0.000 Best wishes Ulf P.S. A somewhat more meaningful comparison of version 2 and 3: a = data.frame(matrix(1:3e5, ncol=3)) # time r2e5: [1] 12.04 0.15 12.92 0.00 0.00 # time r3e5: [1] 0.030 0.020 0.051 0.000 0.000 depending on your problem, using mapply might help, as in the code example below: a = data.frame(matrix(1:3e4, ncol=3)) print(system.time({ r1 = numeric(nrow(a)) for(i in seq_len(nrow(a))) { g = a[i,] r1[i] = mean(c(g$X1, g$X2, g$X3)) }})) print(system.time({ f = function(X1,X2,X3) mean(c(X1, X2, X3)) r2 = do.call(mapply, args=append(f, a)) })) print(identical(r1, r2)) # user system elapsed 6.049 0.200 6.987 user system elapsed 0.508 0.000 0.509 [1] TRUE Best wishes Wolfgang Roger D. Peng wrote: Extracting rows from data frames is tricky, since each of the columns could be of a different class. For your toy example, it seems a matrix would be a more reasonable option. R-devel has some improvements to row extraction, if I remember correctly. You might want to try your example there. -roger Herve Pages wrote: Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Ulf Martin wrote: Here is an even faster one; the general point is to create a properly vectorized custom function/expression: mymean - function(x, y, z) (x+y+z)/3 a = data.frame(matrix(1:3e4, ncol=3)) attach(a) print(system.time({r3 = mymean(X1,X2,X3)})) detach(a) # Yields: # [1] 0.000 0.010 0.005 0.000 0.000 Very fast indeed! And you don't need the attach/detach trick to make your point since it is (almost) as fast without it: a = data.frame(matrix(1:3e4, ncol=3)) print(system.time({r3 = mymean(a$X1,a$X2,a$X3)})) However, you are lucky here because in this example (the mean example), you can use vectorized arithmetic which is of course very fast. What about the general case? Unfortunately situations where you can properly vectorize tend to be much more frequent in tutorials and demos than in the real world. Maybe the mean example is a little bit too specific to answer the general question of what's the best way to _efficiently_ step on a data frame row by row. Cheers, H. print(identical(r2, r3)) # [1] TRUE # May values for version 1 and 2 resp. were # time for r1: [1] 29.420 23.090 60.093 0.000 0.000 # time for r2: [1] 1.400 0.050 1.505 0.000 0.000 Best wishes Ulf P.S. A somewhat more meaningful comparison of version 2 and 3: a = data.frame(matrix(1:3e5, ncol=3)) # time r2e5: [1] 12.04 0.15 12.92 0.00 0.00 # time r3e5: [1] 0.030 0.020 0.051 0.000 0.000 depending on your problem, using mapply might help, as in the code example below: a = data.frame(matrix(1:3e4, ncol=3)) print(system.time({ r1 = numeric(nrow(a)) for(i in seq_len(nrow(a))) { g = a[i,] r1[i] = mean(c(g$X1, g$X2, g$X3)) }})) print(system.time({ f = function(X1,X2,X3) mean(c(X1, X2, X3)) r2 = do.call(mapply, args=append(f, a)) })) print(identical(r1, r2)) # user system elapsed 6.049 0.200 6.987 user system elapsed 0.508 0.000 0.509 [1] TRUE Best wishes Wolfgang Roger D. Peng wrote: Extracting rows from data frames is tricky, since each of the columns could be of a different class. For your toy example, it seems a matrix would be a more reasonable option. R-devel has some improvements to row extraction, if I remember correctly. You might want to try your example there. -roger Herve Pages wrote: Hi, I have a big data frame: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this idiomatic form is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) } then it's 20 times faster!! system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Hi Greg, Greg Snow wrote: Your 2 examples have 2 differences and they are therefore confounded in their effects. What are your results for: system.time(for (i in 1:100) {row - dat[i, ] }) Right. What you suggest is even faster (and more simple): mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] }) user system elapsed 13.241 0.460 13.702 system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) }) user system elapsed 0.280 0.372 0.650 system.time(for (i in 1:100) {row - dat[i, ] }) user system elapsed 0.044 0.088 0.130 So apparently here extracting with dat[i, ] is 300 times faster than extracting with dat[key, ] ! system.time(for (i in 1:100) dat[1, ]) user system elapsed 12.680 0.396 13.075 system.time(for (i in 1:100) dat[1, ]) user system elapsed 0.060 0.076 0.137 Good to know! Thanks a lot, H. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues
Herve Pages [EMAIL PROTECTED] writes: So apparently here extracting with dat[i, ] is 300 times faster than extracting with dat[key, ] ! system.time(for (i in 1:100) dat[1, ]) user system elapsed 12.680 0.396 13.075 system.time(for (i in 1:100) dat[1, ]) user system elapsed 0.060 0.076 0.137 Good to know! I think what you are seeing here has to do with the space efficient storage of row.names of a data.frame. The example data you are working with has no specified row names and so they get stored in a compact fashion: mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5) dat - as.data.frame(mat) typeof(attr(dat, row.names)) [1] integer In the call to [.data.frame when i is character, the appropriate index is found using pmatch and this requires that the row names be converted to character. So in a loop, you get to convert the integer vector to character vector at each iteration. If you assign character row names, things will be a bit faster: # before system.time(for (i in 1:25) dat[2, ]) user system elapsed 9.337 0.404 10.731 # this looks funny, but has the desired result rownames(dat) - rownames(dat) typeof(attr(dat, row.names) # after system.time(for (i in 1:25) dat[2, ]) user system elapsed 0.343 0.226 0.608 And you probably would have seen this if you had looked at the the profiling data: Rprof() for (i in 1:25) dat[2, ] Rprof(NULL) summaryRprof() + seth __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel