[Rd] S3 best practice

2007-03-02 Thread Robin Hankin
Hello everyone

Suppose I have an S3 class dog and a function plot.dog() which
looks like this:

plot.dog - function(x,show.uncertainty, ...){
 do a simple plot
   if (show.uncertainty){
   perform complicated combinatorial stuff that takes 20 minutes
   and superimpose the results on the simple plot
   }
}


I think that it would be better to somehow precalculate the
uncertainty stuff and plot it separately.

How best to do this
in the context of an S3 method for plot()?

What is Best Practice here?



--
Robin Hankin
Uncertainty Analyst
National Oceanography Centre, Southampton
European Way, Southampton SO14 3ZH, UK
  tel  023-8059-7743

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Install.packages() bug in Windows XP (PR#9540)

2007-03-02 Thread juan
Dear User=B4s,
=20
run R2.2.0 for Windows Version (S.O. Windows XP) when install.packages()
function, after select the mirror, shown:
--- Please select a CRAN mirror for use in this session ---
Aviso: unable to access index for repository
http://cran.br.r-project.org/bin/windows/contrib/2.2
Aviso: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.2
Erro em install.packages() : argumento pkgs ausente, sem padr=E3o

Thanks,
=20
=20
Eng=BA Juan S. Ramseyer.
=20

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Install.packages() bug in Windows XP (PR#9540)

2007-03-02 Thread Uwe Ligges
This is not a bug.
And if a bug, you are asked to only report bugs of recent versions of R!
Please ask questions on R-help!

1. Please check your firewall and proxy settings.
2. Please upgrade to a recent version of R.

Uwe Ligges



[EMAIL PROTECTED] wrote:
 Dear User=B4s,
 =20
 run R2.2.0 for Windows Version (S.O. Windows XP) when install.packages()
 function, after select the mirror, shown:
 --- Please select a CRAN mirror for use in this session ---
 Aviso: unable to access index for repository
 http://cran.br.r-project.org/bin/windows/contrib/2.2
 Aviso: unable to access index for repository
 http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.2
 Erro em install.packages() : argumento pkgs ausente, sem padr=E3o
 
 Thanks,
 =20
 =20
 Eng=BA Juan S. Ramseyer.
 =20
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] S3 best practice

2007-03-02 Thread Seth Falcon
Robin Hankin [EMAIL PROTECTED] writes:

 Hello everyone

 Suppose I have an S3 class dog and a function plot.dog() which
 looks like this:

 plot.dog - function(x,show.uncertainty, ...){
  do a simple plot
if (show.uncertainty){
perform complicated combinatorial stuff that takes 20 minutes
and superimpose the results on the simple plot
}
 }

How uncertain is the dog in the window?

 I think that it would be better to somehow precalculate the
 uncertainty stuff and plot it separately.

 How best to do this
 in the context of an S3 method for plot()?

Doing long computations within plot functions can be annoying because
often one needs to tweak the visual style of a plot and this
requires numerous round trips.  So I like your idea of precomputing
the uncertainty stuff.

uncertainty.dog could return data that could then optionally be passed
into the plot method.  Another possibility is that the dog class
could store the uncertainty data and then the plot method would plot
it if it is there (and/or if an option to plot is given).  In this
case, I guess it would be:  x - addUncertainty(x)

+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Wishlist: Make screeplot() a generic (PR#9541)

2007-03-02 Thread gavin . simpson
Full_Name: Gavin Simpson
Version: 2.5.0
OS: Linux (FC5)
Submission from: (NULL) (128.40.33.76)


Screeplots are a common plot-type used to interpret the results of various
ordination methods and other techniques. A number of packages include ordination
techniques not included in a standard R installation. screeplot() works for
princomp and prcomp objects, but not for these other techniques as it was not
designed to do so. The current situation means, for example, that I have called
a function Screeplot() in one of my packages, but it would be easier for users
if they only had to remember to use screeplot() to generate a screeplot.

I would like to request that screeplot be made generic and methods for prcomp
and princomp added to R devel. This way, package authors can provide screeplot
methods for their functions as appropriate.

I have taken a look at the sources for R devel (from the SVN repository) in file
princomp-add.R and prcomp.R and it looks a relatively simple change to make
screeplot generic.

I would be happy to provide patches and documentation if R Core were interested
in making this change - I haven't done this yet as I don't want to spend time
doing something that might not be acceptable to R core in general.

Many thanks,

G

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Hi,


I have a big data frame:

   mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
   dat - as.data.frame(mat)

and I need to do some computation on each row. Currently I'm doing this:

   for (key in row.names(dat)) { row - dat[key, ]; ... do some computation on 
row... }

which could probably considered a very natural (and R'ish) way of doing it
(but maybe I'm wrong and the real idiom for doing this is something different).

The problem with this idiomatic form is that it is _very_ slow. The loop
itself + the simple extraction of the rows (no computation on the rows) takes
10 hours on a powerful server (quad core Linux with 8G of RAM)!

Looping over the first 100 rows takes 12 seconds:

   system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
 user  system elapsed
   12.637   0.120  12.756

But if, instead of the above, I do this:

   for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }

then it's 20 times faster!!

   system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) })
 user  system elapsed
0.576   0.096   0.673

I hope you will agree that this second form is much less natural.

So I was wondering why the idiomatic form is so slow? Shouldn't the idiomatic
form be, not only elegant and easy to read, but also efficient?


Thanks,
H.


 sessionInfo()
R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods
[7] base

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Roger D. Peng
Extracting rows from data frames is tricky, since each of the columns could be 
of a different class.  For your toy example, it seems a matrix would be a more 
reasonable option.

R-devel has some improvements to row extraction, if I remember correctly.  You 
might want to try your example there.

-roger

Herve Pages wrote:
 Hi,
 
 
 I have a big data frame:
 
mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)
 
 and I need to do some computation on each row. Currently I'm doing this:
 
for (key in row.names(dat)) { row - dat[key, ]; ... do some computation 
 on row... }
 
 which could probably considered a very natural (and R'ish) way of doing it
 (but maybe I'm wrong and the real idiom for doing this is something 
 different).
 
 The problem with this idiomatic form is that it is _very_ slow. The loop
 itself + the simple extraction of the rows (no computation on the rows) takes
 10 hours on a powerful server (quad core Linux with 8G of RAM)!
 
 Looping over the first 100 rows takes 12 seconds:
 
system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
  user  system elapsed
12.637   0.120  12.756
 
 But if, instead of the above, I do this:
 
for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }
 
 then it's 20 times faster!!
 
system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) })
  user  system elapsed
 0.576   0.096   0.673
 
 I hope you will agree that this second form is much less natural.
 
 So I was wondering why the idiomatic form is so slow? Shouldn't the 
 idiomatic
 form be, not only elegant and easy to read, but also efficient?
 
 
 Thanks,
 H.
 
 
 sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
 x86_64-unknown-linux-gnu
 
 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
 
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 [7] base
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 

-- 
Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Greg Snow
Your 2 examples have 2 differences and they are therefore confounded in
their effects.

What are your results for:

system.time(for (i in 1:100) {row -  dat[i, ] })



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111
 
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Herve Pages
 Sent: Friday, March 02, 2007 11:40 AM
 To: r-devel@r-project.org
 Subject: [Rd] extracting rows from a data frame by looping 
 over the row names: performance issues
 
 Hi,
 
 
 I have a big data frame:
 
mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)
 
 and I need to do some computation on each row. Currently I'm 
 doing this:
 
for (key in row.names(dat)) { row - dat[key, ]; ... do 
 some computation on row... }
 
 which could probably considered a very natural (and R'ish) 
 way of doing it (but maybe I'm wrong and the real idiom for 
 doing this is something different).
 
 The problem with this idiomatic form is that it is _very_ 
 slow. The loop itself + the simple extraction of the rows (no 
 computation on the rows) takes 10 hours on a powerful server 
 (quad core Linux with 8G of RAM)!
 
 Looping over the first 100 rows takes 12 seconds:
 
system.time(for (key in row.names(dat)[1:100]) { row - 
 dat[key, ] })
  user  system elapsed
12.637   0.120  12.756
 
 But if, instead of the above, I do this:
 
for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }
 
 then it's 20 times faster!!
 
system.time(for (i in 1:100) { row - sapply(dat, 
 function(col) col[i]) })
  user  system elapsed
 0.576   0.096   0.673
 
 I hope you will agree that this second form is much less natural.
 
 So I was wondering why the idiomatic form is so slow? 
 Shouldn't the idiomatic form be, not only elegant and easy to 
 read, but also efficient?
 
 
 Thanks,
 H.
 
 
  sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 
 r40386) x86_64-unknown-linux-gnu
 
 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_
 MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_A
 DDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
 
 attached base packages:
 [1] stats graphics  grDevices utils 
 datasets  methods
 [7] base
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Wolfgang Huber

Hi Hervé

depending on your problem, using mapply might help, as in the code 
example below:

a = data.frame(matrix(1:3e4, ncol=3))

print(system.time({
r1 = numeric(nrow(a))
for(i in seq_len(nrow(a))) {
   g = a[i,]
   r1[i] = mean(c(g$X1, g$X2, g$X3))
}}))

print(system.time({
f = function(X1,X2,X3) mean(c(X1, X2, X3))
r2 = do.call(mapply, args=append(f, a))
}))

print(identical(r1, r2))

#   user  system elapsed
   6.049   0.200   6.987
user  system elapsed
   0.508   0.000   0.509
[1] TRUE

  Best wishes
   Wolfgang

Roger D. Peng wrote:
 Extracting rows from data frames is tricky, since each of the columns could 
 be 
 of a different class.  For your toy example, it seems a matrix would be a 
 more 
 reasonable option.
 
 R-devel has some improvements to row extraction, if I remember correctly.  
 You 
 might want to try your example there.
 
 -roger
 
 Herve Pages wrote:
 Hi,


 I have a big data frame:

mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)

 and I need to do some computation on each row. Currently I'm doing this:

for (key in row.names(dat)) { row - dat[key, ]; ... do some computation 
 on row... }

 which could probably considered a very natural (and R'ish) way of doing it
 (but maybe I'm wrong and the real idiom for doing this is something 
 different).

 The problem with this idiomatic form is that it is _very_ slow. The loop
 itself + the simple extraction of the rows (no computation on the rows) takes
 10 hours on a powerful server (quad core Linux with 8G of RAM)!

 Looping over the first 100 rows takes 12 seconds:

system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
  user  system elapsed
12.637   0.120  12.756

 But if, instead of the above, I do this:

for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }

 then it's 20 times faster!!

system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) 
 })
  user  system elapsed
 0.576   0.096   0.673

 I hope you will agree that this second form is much less natural.

 So I was wondering why the idiomatic form is so slow? Shouldn't the 
 idiomatic
 form be, not only elegant and easy to read, but also efficient?


 Thanks,
 H.


 sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
 x86_64-unknown-linux-gnu

 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 [7] base

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

 


-- 

Best wishes
   Wolfgang

--
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Patch for format.pval limitation in format.R

2007-03-02 Thread Charles Dupont

'format.pval' has a major limitation in its implementation for example
suppose a person had a vector like 'a' and the error being ±0.001.

 a - c(0.1, 0.3, 0.4, 0.5, 0.3, 0.0001)
 format.pval(a, eps=0.001)

The person wants to have the 'format.pval' output with 2 digits always
showing like this

[1] 0.10   0.30   0.40   0.50   0.30   0.001

How ever format.pval can only display this

[1] 0.10.30.40.50.30.001

If this was the 'format' function this could be corrected by setting the
'nsmall' argument to 2.  But 'format.pval' has no ability to pass
arguments to format.


I think that the best solution would be to give 'format.pval' a '...'
argument that would get passed to all the 'format' function calls in
'format.pval'.

I have attached a patch that does this.  This patch is against svn
r-release-branch, but it also works with r-devel.


Charles Dupont
--
Charles Dupont  Computer System Analyst School of Medicine
Department of Biostatistics Vanderbilt University

Index: src/library/base/R/format.R
===
--- src/library/base/R/format.R	(revision 40768)
+++ src/library/base/R/format.R	(working copy)
@@ -43,7 +43,7 @@
 }
 
 format.pval - function(pv, digits = max(1, getOption(digits)-2),
-			eps = .Machine$double.eps, na.form = NA)
+			eps = .Machine$double.eps, na.form = NA, ...)
 {
 ## Format  P values; auxiliary for print.summary.[g]lm(.)
 
@@ -55,8 +55,8 @@
 	## be smart -- differ for fixp. and expon. display:
 	expo - floor(log10(ifelse(pv  0, pv, 1e-50)))
 	fixp - expo = -3 | (expo == -4  digits1)
-	if(any( fixp)) rr[ fixp] - format(pv[ fixp], dig=digits)
-	if(any(!fixp)) rr[!fixp] - format(pv[!fixp], dig=digits)
+	if(any( fixp)) rr[ fixp] - format(pv[ fixp], dig=digits, ...)
+	if(any(!fixp)) rr[!fixp] - format(pv[!fixp], dig=digits, ...)
 	r[!is0]- rr
 }
 if(any(is0)) {
@@ -67,7 +67,7 @@
 		digits - max(1, nc - 7)
 	sep - if(digits==1  nc = 6)  else  
 	} else sep - if(digits==1)  else  
-	r[is0] - paste(, format(eps, digits=digits), sep = sep)
+	r[is0] - paste(, format(eps, digits=digits, ...), sep = sep)
 }
 if(has.na) { ## rarely
 	rok - r
Index: src/library/base/man/format.pval.Rd
===
--- src/library/base/man/format.pval.Rd	(revision 40768)
+++ src/library/base/man/format.pval.Rd	(working copy)
@@ -6,13 +6,14 @@
 \alias{format.pval}
 \usage{
 format.pval(pv, digits = max(1, getOption(digits) - 2),
-eps = .Machine$double.eps, na.form = NA)
+eps = .Machine$double.eps, na.form = NA, \dots)
 }
 \arguments{
   \item{pv}{a numeric vector.}
   \item{digits}{how many significant digits are to be used.}
   \item{eps}{a numerical tolerance: see Details.}
   \item{na.form}{character representation of \code{NA}s.}
+  \item{\dots}{arguments passed to the \code{\link{format}} function.}
 }
 \value{
   A character vector.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Roger D. Peng wrote:
 Extracting rows from data frames is tricky, since each of the columns
 could be of a different class.  For your toy example, it seems a matrix
 would be a more reasonable option.

There is no doubt about this ;-)

   mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
   dat - as.data.frame(mat)

With the matrix:

   system.time(for (i in 1:100) { row - mat[i, ] })
 user  system elapsed
0   0   0

With the data frame:

   system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
 user  system elapsed
   12.565   0.296  12.859


And even with a mixed-type data frame, it's very tempting to convert it
to a matrix before to do any loop on it:

   dat2 - as.data.frame(mat, stringsAsFactors=FALSE)
   dat2 - cbind(dat2, ii=1:30)
   sapply(dat2, typeof)
   V1  V2  V3  V4  V5  ii
  character character character character character   integer

   system.time(for (key in row.names(dat2)[1:100]) { row - dat2[key, ] })
 user  system elapsed
   13.201   0.144  13.360

   system.time({mat2 - as.matrix(dat2); for (i in 1:100) { row - mat2[i, ] 
}})
 user  system elapsed
0.128   0.036   0.163

Big win isn't it? (only if you have enough memory for it though...)

Cheers,
H.



 
 R-devel has some improvements to row extraction, if I remember
 correctly.  You might want to try your example there.
 
 -roger
 
 Herve Pages wrote:
 Hi,


 I have a big data frame:

mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)

 and I need to do some computation on each row. Currently I'm doing this:

for (key in row.names(dat)) { row - dat[key, ]; ... do some
 computation on row... }

 which could probably considered a very natural (and R'ish) way of
 doing it
 (but maybe I'm wrong and the real idiom for doing this is something
 different).

 The problem with this idiomatic form is that it is _very_ slow. The
 loop
 itself + the simple extraction of the rows (no computation on the
 rows) takes
 10 hours on a powerful server (quad core Linux with 8G of RAM)!

 Looping over the first 100 rows takes 12 seconds:

system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
  user  system elapsed
12.637   0.120  12.756

 But if, instead of the above, I do this:

for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }

 then it's 20 times faster!!

system.time(for (i in 1:100) { row - sapply(dat, function(col)
 col[i]) })
  user  system elapsed
 0.576   0.096   0.673

 I hope you will agree that this second form is much less natural.

 So I was wondering why the idiomatic form is so slow? Shouldn't the
 idiomatic
 form be, not only elegant and easy to read, but also efficient?


 Thanks,
 H.


 sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
 x86_64-unknown-linux-gnu

 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C


 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 [7] base

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Ulf Martin
Here is an even faster one; the general point is to create a properly
vectorized custom function/expression:

mymean - function(x, y, z) (x+y+z)/3

a = data.frame(matrix(1:3e4, ncol=3))
attach(a)
print(system.time({r3 = mymean(X1,X2,X3)}))
detach(a)

# Yields:
# [1] 0.000 0.010 0.005 0.000 0.000

print(identical(r2, r3))
# [1] TRUE

# May values for version 1 and 2 resp. were
# time for r1:
[1] 29.420 23.090 60.093  0.000  0.000

# time for r2:
[1] 1.400 0.050 1.505 0.000 0.000

Best wishes
Ulf


P.S. A somewhat more meaningful comparison of version 2 and 3:

a = data.frame(matrix(1:3e5, ncol=3))
# time r2e5:
[1] 12.04  0.15 12.92  0.00  0.00

# time r3e5:
[1] 0.030 0.020 0.051 0.000 0.000

 depending on your problem, using mapply might help, as in the code 
 example below:
 
 a = data.frame(matrix(1:3e4, ncol=3))
 
 print(system.time({
 r1 = numeric(nrow(a))
 for(i in seq_len(nrow(a))) {
g = a[i,]
r1[i] = mean(c(g$X1, g$X2, g$X3))
 }}))
 
 print(system.time({
 f = function(X1,X2,X3) mean(c(X1, X2, X3))
 r2 = do.call(mapply, args=append(f, a))
 }))
 
 print(identical(r1, r2))
 
 #   user  system elapsed
6.049   0.200   6.987
 user  system elapsed
0.508   0.000   0.509
 [1] TRUE
 
   Best wishes
Wolfgang
 
 Roger D. Peng wrote:
 Extracting rows from data frames is tricky, since each of the columns could 
 be 
 of a different class.  For your toy example, it seems a matrix would be a 
 more 
 reasonable option.

 R-devel has some improvements to row extraction, if I remember correctly.  
 You 
 might want to try your example there.

 -roger

 Herve Pages wrote:
 Hi,


 I have a big data frame:

mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)

 and I need to do some computation on each row. Currently I'm doing this:

for (key in row.names(dat)) { row - dat[key, ]; ... do some 
 computation on row... }

 which could probably considered a very natural (and R'ish) way of doing it
 (but maybe I'm wrong and the real idiom for doing this is something 
 different).

 The problem with this idiomatic form is that it is _very_ slow. The loop
 itself + the simple extraction of the rows (no computation on the rows) 
 takes
 10 hours on a powerful server (quad core Linux with 8G of RAM)!

 Looping over the first 100 rows takes 12 seconds:

system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
  user  system elapsed
12.637   0.120  12.756

 But if, instead of the above, I do this:

for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }

 then it's 20 times faster!!

system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) 
 })
  user  system elapsed
 0.576   0.096   0.673

 I hope you will agree that this second form is much less natural.

 So I was wondering why the idiomatic form is so slow? Shouldn't the 
 idiomatic
 form be, not only elegant and easy to read, but also efficient?


 Thanks,
 H.


 sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
 x86_64-unknown-linux-gnu

 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 [7] base

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

 


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Ulf Martin wrote:
 Here is an even faster one; the general point is to create a properly
 vectorized custom function/expression:
 
 mymean - function(x, y, z) (x+y+z)/3
 
 a = data.frame(matrix(1:3e4, ncol=3))
 attach(a)
 print(system.time({r3 = mymean(X1,X2,X3)}))
 detach(a)
 
 # Yields:
 # [1] 0.000 0.010 0.005 0.000 0.000
 

Very fast indeed! And you don't need the attach/detach trick to make your point
since it is (almost) as fast without it:

  a = data.frame(matrix(1:3e4, ncol=3))
  print(system.time({r3 = mymean(a$X1,a$X2,a$X3)}))

However, you are lucky here because in this example (the mean example), you 
can
use vectorized arithmetic which is of course very fast.
What about the general case? Unfortunately situations where you can properly 
vectorize
tend to be much more frequent in tutorials and demos than in the real world.
Maybe the mean example is a little bit too specific to answer the
general question of what's the best way to _efficiently_ step on a data
frame row by row.

Cheers,
H.



 print(identical(r2, r3))
 # [1] TRUE
 
 # May values for version 1 and 2 resp. were
 # time for r1:
 [1] 29.420 23.090 60.093  0.000  0.000
 
 # time for r2:
 [1] 1.400 0.050 1.505 0.000 0.000
 
 Best wishes
 Ulf
 
 
 P.S. A somewhat more meaningful comparison of version 2 and 3:
 
 a = data.frame(matrix(1:3e5, ncol=3))
 # time r2e5:
 [1] 12.04  0.15 12.92  0.00  0.00
 
 # time r3e5:
 [1] 0.030 0.020 0.051 0.000 0.000
 
 depending on your problem, using mapply might help, as in the code 
 example below:

 a = data.frame(matrix(1:3e4, ncol=3))

 print(system.time({
 r1 = numeric(nrow(a))
 for(i in seq_len(nrow(a))) {
g = a[i,]
r1[i] = mean(c(g$X1, g$X2, g$X3))
 }}))

 print(system.time({
 f = function(X1,X2,X3) mean(c(X1, X2, X3))
 r2 = do.call(mapply, args=append(f, a))
 }))

 print(identical(r1, r2))

 #   user  system elapsed
6.049   0.200   6.987
 user  system elapsed
0.508   0.000   0.509
 [1] TRUE

   Best wishes
Wolfgang

 Roger D. Peng wrote:
 Extracting rows from data frames is tricky, since each of the columns could 
 be 
 of a different class.  For your toy example, it seems a matrix would be a 
 more 
 reasonable option.

 R-devel has some improvements to row extraction, if I remember correctly.  
 You 
 might want to try your example there.

 -roger

 Herve Pages wrote:
 Hi,


 I have a big data frame:

mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)

 and I need to do some computation on each row. Currently I'm doing this:

for (key in row.names(dat)) { row - dat[key, ]; ... do some 
 computation on row... }

 which could probably considered a very natural (and R'ish) way of doing it
 (but maybe I'm wrong and the real idiom for doing this is something 
 different).

 The problem with this idiomatic form is that it is _very_ slow. The loop
 itself + the simple extraction of the rows (no computation on the rows) 
 takes
 10 hours on a powerful server (quad core Linux with 8G of RAM)!

 Looping over the first 100 rows takes 12 seconds:

system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
  user  system elapsed
12.637   0.120  12.756

 But if, instead of the above, I do this:

for (i in nrow(dat)) { row - sapply(dat, function(col) col[i]) }

 then it's 20 times faster!!

system.time(for (i in 1:100) { row - sapply(dat, function(col) 
 col[i]) })
  user  system elapsed
 0.576   0.096   0.673

 I hope you will agree that this second form is much less natural.

 So I was wondering why the idiomatic form is so slow? Shouldn't the 
 idiomatic
 form be, not only elegant and easy to read, but also efficient?


 Thanks,
 H.


 sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
 x86_64-unknown-linux-gnu

 locale:
 LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 [7] base

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Herve Pages
Hi Greg,

Greg Snow wrote:
 Your 2 examples have 2 differences and they are therefore confounded in
 their effects.
 
 What are your results for:
 
 system.time(for (i in 1:100) {row -  dat[i, ] })
 
 
 

Right. What you suggest is even faster (and more simple):

   mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
   dat - as.data.frame(mat)

   system.time(for (key in row.names(dat)[1:100]) { row - dat[key, ] })
 user  system elapsed
   13.241   0.460  13.702

   system.time(for (i in 1:100) { row - sapply(dat, function(col) col[i]) })
 user  system elapsed
0.280   0.372   0.650

   system.time(for (i in 1:100) {row -  dat[i, ] })
 user  system elapsed
0.044   0.088   0.130

So apparently here extracting with dat[i, ] is 300 times faster than
extracting with dat[key, ] !

 system.time(for (i in 1:100) dat[1, ])
   user  system elapsed
 12.680   0.396  13.075

 system.time(for (i in 1:100) dat[1, ])
   user  system elapsed
  0.060   0.076   0.137

Good to know!

Thanks a lot,
H.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] extracting rows from a data frame by looping over the row names: performance issues

2007-03-02 Thread Seth Falcon
Herve Pages [EMAIL PROTECTED] writes:
 So apparently here extracting with dat[i, ] is 300 times faster than
 extracting with dat[key, ] !

 system.time(for (i in 1:100) dat[1, ])
user  system elapsed
  12.680   0.396  13.075

 system.time(for (i in 1:100) dat[1, ])
user  system elapsed
   0.060   0.076   0.137

 Good to know!

I think what you are seeing here has to do with the space efficient
storage of row.names of a data.frame.  The example data you are
working with has no specified row names and so they get stored in a
compact fashion:

mat - matrix(rep(paste(letters, collapse=), 5*30), ncol=5)
dat - as.data.frame(mat)

 typeof(attr(dat, row.names))
[1] integer

In the call to [.data.frame when i is character, the appropriate index
is found using pmatch and this requires that the row names be
converted to character.  So in a loop, you get to convert the integer
vector to character vector at each iteration.

If you assign character row names, things will be a bit faster:

# before
system.time(for (i in 1:25) dat[2, ])
   user  system elapsed 
  9.337   0.404  10.731 

# this looks funny, but has the desired result
rownames(dat) - rownames(dat)
typeof(attr(dat, row.names)

# after
system.time(for (i in 1:25) dat[2, ])
   user  system elapsed 
  0.343   0.226   0.608 

And you probably would have seen this if you had looked at the the
profiling data:

Rprof()
for (i in 1:25) dat[2, ]
Rprof(NULL)
summaryRprof()


+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel