Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Matthew Dowle
 Matthew,

 I was hoping I misunderstood you first proposal, but I suspect I did not
 ;).

 Personally, I find  DT[1,V1 - 3] highly disturbing - I would expect it to
 evaluate to
 { V1 - 3; DT[1, V1] }
 thus returning the first element of the third column.

Please see FAQ 1.1, since further below it seems to be an expectation
issue about 'with' syntax, too.


 That said, I don't think it works, either. Taking you example and
 data.table form r-forge:
[ snip ]
 as you can see, DT is not modified.

Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
the non-working state I'll need some more environment information please.

 Also I suspect there is something quite amiss because even trivial things
 don't work:

 DF[1:4,1:4]
   V1 V2 V3 V4
 1  3  1  1  1
 2  1  1  1  1
 3  1  1  1  1
 4  1  1  1  1
 DT[1:4,1:4]
 [1] 1 2 3 4

That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
and 1.10.


 When I first saw your proposal, I thought you have rather something like
 within(DT, V1[1] - 3)
 in mind which looks innocent enough but performs terribly (note that I had
 to scale down the loop by a factor of 100!!!):

 system.time(for (i in 1:10) within(DT, V1[1] - 3))
user  system elapsed
   2.701   4.437   7.138

No, since 'with' is already built into data.table, I was thinking of
building 'within' in, too. I'll take a look at within(). Might as well
provide as many options as possible to the user to use as they wish.

 With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3))
 performs reasonably:

 system.time(within(DT, for (i in 1:1000) V1[i] - 3))
user  system elapsed
   0.392   0.613   1.003

 (Note: system.time() can be misleading when within() is involved, because
 the expression is evaluated in a different environment so within() won't
 actually change the object in the  global environment - it also interacts
 with the possible duplication)

Noted, thanks. That's pretty fast. Does within() on data.frame fix the
original issue Ivo raised, then?  If so, job done.


 Cheers,
 Simon

 On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

 Thanks for the replies and info. An attempt at fast
 assign is now committed to data.table v1.6.3 on
 R-Forge. From NEWS :

 o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.

Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.

[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.

 o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,

DT[i,colname-value]

This syntax makes no copies of any part of memory at all.

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed
287.730 323.196 613.453
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed
  1.152   0.004   1.161 # 528 times faster

 Please note :

***
**  Within syntax is presently highly experimental.  **
***

 http://datatable.r-forge.r-project.org/


 On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:

 Interesting, and I stand corrected:

 x = data.frame(a=1:n,b=1:n)
 .Internal(inspect(x))
 @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[1,1]=42L
 .Internal(inspect(x))
 @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


 I have R to release ;) so I won't be looking into this right now, but
 it's something worth investigating ... Since all the inner contents
 have NAMED=0 I would not expect any duplication to be needed, but
 apparently becomes so is at some point ...


 The internals 

Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Matthew Dowle
Thanks for the replies and info. An attempt at fast
assign is now committed to data.table v1.6.3 on
R-Forge. From NEWS :

o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.

Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.

[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.

o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,
 
DT[i,colname-value]

This syntax makes no copies of any part of memory at all.

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed 
287.730 323.196 613.453 
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed 
  1.152   0.004   1.161 # 528 times faster

Please note :

***
**  Within syntax is presently highly experimental.  **
***

http://datatable.r-forge.r-project.org/


On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:
 
  Interesting, and I stand corrected:
 
  x = data.frame(a=1:n,b=1:n)
  .Internal(inspect(x))
  @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
   @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
   @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
  x[1,1]=42L
  .Internal(inspect(x))
  @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
   @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
   @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
  x[[1]][1]=42L
  .Internal(inspect(x))
  @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
   @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
   @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...
 
  x[[1]][1]=42L
  .Internal(inspect(x))
  @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
   @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
   @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
 
  I have R to release ;) so I won't be looking into this right now, but it's 
  something worth investigating ... Since all the inner contents have NAMED=0 
  I would not expect any duplication to be needed, but apparently becomes so 
  is at some point ...
 
 
 The internals assume in various places that deep copies are made (one
 of the reasons NAMED setings are not propagated to sub-sturcture).
 The main issues are avoiding cycles and that there is no easy way to
 check for sharing.  There may be some circumstances in which a shallow
 copy would be OK but making sure it would be in all cases is probably
 more trouble than it is worth at this point. (I've tried this in the
 past in a few cases and always had to back off.)
 
 
 Best,
 
 luke
 
 
  Cheers,
  Simon
 
 
  On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
 
 
  On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
  No subassignment function satisfies that condition, because you can 
  always call them directly. However, that doesn't stop the default method 
  from making that assumption, so I'm not sure it's an issue.
 
  David, Just to clarify - the data frame content is not copied, we are 
  talking about the vector holding columns.
 
  If it is just the vector holding the columns that is copied (and not the
  columns themselves), why does n make a difference in this test (on R
  2.13.0)?
 
  n = 1000
  x = data.frame(a=1:n,b=1:n)
  system.time(for (i in 1:1000) x[1,1] - 42L)
user  system elapsed
   0.628   0.000   0.628
  n = 10
  x = data.frame(a=1:n,b=1:n)  # still 2 columns, but longer columns
  system.time(for (i in 1:1000) x[1,1] - 42L)
user  system elapsed
  20.145   1.232  21.455
 
 
  With $- :
 
  n = 1000
  x = data.frame(a=1:n,b=1:n)
  system.time(for (i in 1:1000) x$a[1] - 42L)
user  system elapsed
   0.304   0.000   0.307
  n = 10
  x = data.frame(a=1:n,b=1:n)
  system.time(for (i in 1:1000) x$a[1] - 42L)
user  system elapsed
  37.586   0.388  38.161
 
 
  If it's because the 1st column needs to be copied (only) because that's
  the one being assigned to (in this test), that magnitude of slow down
  doesn't seem consistent with the time of a vector copy of the 1st
  column :
 
  n=10
  v = 1:n
  system.time(for (i in 1:1000) v[1] - 42L)
user  

Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Matthew Dowle

Simon,
If you didn't install.packages() with method=source from R-Forge, that
would explain (some of) it. R-Forge builds binaries once each night. This
commit was long after the cutoff.
Matthew

 Matthew,

 I was hoping I misunderstood you first proposal, but I suspect I did not
 ;).

 Personally, I find  DT[1,V1 - 3] highly disturbing - I would expect it
 to
 evaluate to
 { V1 - 3; DT[1, V1] }
 thus returning the first element of the third column.

 Please see FAQ 1.1, since further below it seems to be an expectation
 issue about 'with' syntax, too.


 That said, I don't think it works, either. Taking you example and
 data.table form r-forge:
 [ snip ]
 as you can see, DT is not modified.

 Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
 the non-working state I'll need some more environment information please.

 Also I suspect there is something quite amiss because even trivial
 things
 don't work:

 DF[1:4,1:4]
   V1 V2 V3 V4
 1  3  1  1  1
 2  1  1  1  1
 3  1  1  1  1
 4  1  1  1  1
 DT[1:4,1:4]
 [1] 1 2 3 4

 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
 and 1.10.


 When I first saw your proposal, I thought you have rather something like
 within(DT, V1[1] - 3)
 in mind which looks innocent enough but performs terribly (note that I
 had
 to scale down the loop by a factor of 100!!!):

 system.time(for (i in 1:10) within(DT, V1[1] - 3))
user  system elapsed
   2.701   4.437   7.138

 No, since 'with' is already built into data.table, I was thinking of
 building 'within' in, too. I'll take a look at within(). Might as well
 provide as many options as possible to the user to use as they wish.

 With the for loop something like within(DF, for (i in 1:1000) V1[i] -
 3))
 performs reasonably:

 system.time(within(DT, for (i in 1:1000) V1[i] - 3))
user  system elapsed
   0.392   0.613   1.003

 (Note: system.time() can be misleading when within() is involved,
 because
 the expression is evaluated in a different environment so within() won't
 actually change the object in the  global environment - it also
 interacts
 with the possible duplication)

 Noted, thanks. That's pretty fast. Does within() on data.frame fix the
 original issue Ivo raised, then?  If so, job done.


 Cheers,
 Simon

 On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

 Thanks for the replies and info. An attempt at fast
 assign is now committed to data.table v1.6.3 on
 R-Forge. From NEWS :

 o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.

Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.

[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.

 o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,

DT[i,colname-value]

This syntax makes no copies of any part of memory at all.

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed
287.730 323.196 613.453
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed
  1.152   0.004   1.161 # 528 times faster

 Please note :

***
**  Within syntax is presently highly experimental.  **
***

 http://datatable.r-forge.r-project.org/


 On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:

 Interesting, and I stand corrected:

 x = data.frame(a=1:n,b=1:n)
 .Internal(inspect(x))
 @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[1,1]=42L
 .Internal(inspect(x))
 @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


 I have R to release ;) so I won't be looking into this 

Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Simon Urbanek
On Jul 12, 2011, at 6:24 AM, Matthew Dowle wrote:

 Matthew,
 
 I was hoping I misunderstood you first proposal, but I suspect I did not
 ;).
 
 Personally, I find  DT[1,V1 - 3] highly disturbing - I would expect it to
 evaluate to
 { V1 - 3; DT[1, V1] }
 thus returning the first element of the third column.
 
 Please see FAQ 1.1, since further below it seems to be an expectation
 issue about 'with' syntax, too.
 

Just to clarify - the NEWS has led me to believe that the destructive DT[i, x 
- y] syntax is new. That is what my objection is about. I'm fine with 
subsetting operators working on expressions but I'm not happy with subsetting 
operators modifying the the object they are subsetting - since it's subsetting 
not subassignemnt - that's what I was referring to.



 That said, I don't think it works, either. Taking you example and
 data.table form r-forge:
 [ snip ]
 as you can see, DT is not modified.
 
 Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
 the non-working state I'll need some more environment information please.
 

The issue persist on several machines I tested - including R 2.13.0:

 sessionInfo()
R version 2.13.0 Patched (2011-05-15 r55914)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] data.table_1.6.3


 sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu/amd64 (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] data.table_1.6.3
 DT = as.data.table(m)
 for (i in 1:1000) DT[1,V1 - 3]
 DT[1,]
 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
[1,]  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1   1   1



 Also I suspect there is something quite amiss because even trivial things
 don't work:
 
 DF[1:4,1:4]
  V1 V2 V3 V4
 1  3  1  1  1
 2  1  1  1  1
 3  1  1  1  1
 4  1  1  1  1
 DT[1:4,1:4]
 [1] 1 2 3 4
 
 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
 and 1.10.
 

Fair enough, I expected data.table to be a drop-in replacement of data.frames - 
I just wanted to check the values. Apparently it's not, by design, hence 
assumption was wrong.


 
 When I first saw your proposal, I thought you have rather something like
 within(DT, V1[1] - 3)
 in mind which looks innocent enough but performs terribly (note that I had
 to scale down the loop by a factor of 100!!!):
 
 system.time(for (i in 1:10) within(DT, V1[1] - 3))
   user  system elapsed
  2.701   4.437   7.138
 
 No, since 'with' is already built into data.table, I was thinking of
 building 'within' in, too. I'll take a look at within(). Might as well
 provide as many options as possible to the user to use as they wish.
 
 With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3))
 performs reasonably:
 
 system.time(within(DT, for (i in 1:1000) V1[i] - 3))
   user  system elapsed
  0.392   0.613   1.003
 
 (Note: system.time() can be misleading when within() is involved, because
 the expression is evaluated in a different environment so within() won't
 actually change the object in the  global environment - it also interacts
 with the possible duplication)
 
 Noted, thanks. That's pretty fast. Does within() on data.frame fix the
 original issue Ivo raised, then?  If so, job done.
 

I don't think so - at least not in the strict sense of no copies (more digging 
may be needed, though, since it does so in system.time, possibly due to the 
NAMED value of the forced promise but I did not check). However, it allows to 
express the modification inside the expression which will save the global copy 
and thus be faster that the outside loop.

Cheers,
Simon



 
 Cheers,
 Simon
 
 On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:
 
 Thanks for the replies and info. An attempt at fast
 assign is now committed to data.table v1.6.3 on
 R-Forge. From NEWS :
 
 o   Fast update is now implemented, FR#200.
   DT[i,j]-value is now handled by data.table in C rather
   than falling through to data.frame methods.
 
   Thanks to Ivo Welch for raising speed issues on r-devel,
   to Simon Urbanek for the suggestion, and Luke Tierney and
   Simon for information on R internals.
 
   [- syntax still incurs one working copy of the whole
   table (as of R 2.13.0) due to R's [- dispatch mechanism
   copying to `*tmp*`, so, for ultimate speed and brevity,
   'within' syntax is now 

Re: [Rd] [datatable-help] speeding up perception

2011-07-11 Thread Simon Urbanek
Matthew,

I was hoping I misunderstood you first proposal, but I suspect I did not ;).

Personally, I find  DT[1,V1 - 3] highly disturbing - I would expect it to 
evaluate to
{ V1 - 3; DT[1, V1] }
thus returning the first element of the third column.

I do understand that within(foo, expr, ...) was the motivation for passing 
expressions, but unlike within() the subsetting operator [ is not expected to 
take expression as its second argument. Such abuse is quite unexpected and I 
would say dangerous.

That said, I don't think it works, either. Taking you example and data.table 
form r-forge:

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 for (i in 1:1000) DT[1,V1 - 3]
 DT[1,]
 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
[1,]  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1   1   1

as you can see, DT is not modified.

Also I suspect there is something quite amiss because even trivial things don't 
work:

 DF[1:4,1:4]
  V1 V2 V3 V4
1  3  1  1  1
2  1  1  1  1
3  1  1  1  1
4  1  1  1  1
 DT[1:4,1:4]
[1] 1 2 3 4


When I first saw your proposal, I thought you have rather something like
within(DT, V1[1] - 3)
in mind which looks innocent enough but performs terribly (note that I had to 
scale down the loop by a factor of 100!!!):

 system.time(for (i in 1:10) within(DT, V1[1] - 3))
   user  system elapsed 
  2.701   4.437   7.138 

With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) 
performs reasonably:

 system.time(within(DT, for (i in 1:1000) V1[i] - 3))
   user  system elapsed 
  0.392   0.613   1.003 

(Note: system.time() can be misleading when within() is involved, because the 
expression is evaluated in a different environment so within() won't actually 
change the object in the  global environment - it also interacts with the 
possible duplication)

Cheers,
Simon

On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

 Thanks for the replies and info. An attempt at fast
 assign is now committed to data.table v1.6.3 on
 R-Forge. From NEWS :
 
 o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.
 
Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.
 
[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.
 
 o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,
 
DT[i,colname-value]
 
This syntax makes no copies of any part of memory at all.
 
 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed 
287.730 323.196 613.453 
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed 
  1.152   0.004   1.161 # 528 times faster
 
 Please note :
 
***
**  Within syntax is presently highly experimental.  **
***
 
 http://datatable.r-forge.r-project.org/
 
 
 On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:
 
 Interesting, and I stand corrected:
 
 x = data.frame(a=1:n,b=1:n)
 .Internal(inspect(x))
 @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
 x[1,1]=42L
 .Internal(inspect(x))
 @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
 x[[1]][1]=42L
 .Internal(inspect(x))
 @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...
 
 x[[1]][1]=42L
 .Internal(inspect(x))
 @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
 
 I have R to release ;) so I won't be looking into this right now, but it's 
 something worth investigating ... Since all the inner contents have NAMED=0 
 I would not expect any duplication to be needed, but apparently becomes so 
 is at some point ...
 
 
 The internals assume in various places that deep copies are made (one
 of the reasons NAMED setings are not propagated to sub-sturcture).
 The main 

Re: [Rd] [datatable-help] speeding up perception

2011-07-06 Thread Matthew Dowle

On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
 No subassignment function satisfies that condition, because you can always 
 call them directly. However, that doesn't stop the default method from making 
 that assumption, so I'm not sure it's an issue.
 
 David, Just to clarify - the data frame content is not copied, we are talking 
 about the vector holding columns.

If it is just the vector holding the columns that is copied (and not the
columns themselves), why does n make a difference in this test (on R
2.13.0)?

 n = 1000
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
  0.628   0.000   0.628 
 n = 10
 x = data.frame(a=1:n,b=1:n)  # still 2 columns, but longer columns
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
 20.145   1.232  21.455 
 

With $- :

 n = 1000
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x$a[1] - 42L)
   user  system elapsed 
  0.304   0.000   0.307 
 n = 10
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x$a[1] - 42L)
   user  system elapsed 
 37.586   0.388  38.161 
 

If it's because the 1st column needs to be copied (only) because that's
the one being assigned to (in this test), that magnitude of slow down
doesn't seem consistent with the time of a vector copy of the 1st
column : 

 n=10
 v = 1:n
 system.time(for (i in 1:1000) v[1] - 42L)
   user  system elapsed 
  0.016   0.000   0.017 
 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L})
   user  system elapsed 
  1.816   1.076   2.900

Finally, increasing the number of columns, again only the 1st is
assigned to :

 n=10
 x = data.frame(rep(list(1:n),100))
 dim(x)
[1] 10100
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
167.974  50.903 219.711 
 



 
 Cheers,
 Simon
 
 Sent from my iPhone
 
 On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote:
 
  
  On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu 
  luke-tier...@uiowa.edu wrote:
  
  On Tue, 5 Jul 2011, Simon Urbanek wrote:
  
  
  On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
  
  Simon (and all),
  
  I've tried to make assignment as fast as calling `[-.data.table`
  directly, for user convenience. Profiling shows (IIUC) that it isn't
  dispatch, but x being copied. Is there a way to prevent '[-' from
  copying x?
  
  Good point, and conceptually, no. It's a subassignment after all - see 
  R-lang 3.4.4 - it is equivalent to
  
  `*tmp*` - x
  x - `[-`(`*tmp*`, i, j, value)
  rm(`*tmp*`)
  
  so there is always a copy involved.
  
  Now, a conceptual copy doesn't mean real copy in R since R tries to keep 
  the pass-by-value illusion while passing references in cases where it 
  knows that modifications cannot occur and/or they are safe. The default 
  subassign method uses that feature which means it can afford to not 
  duplicate if there is only one reference -- then it's safe to not 
  duplicate as we are replacing that only existing reference. And in the 
  case of a matrix, that will be true at the latest from the second 
  subassignment on.
  
  Unfortunately the method dispatch (AFAICS) introduces one more reference 
  in the dispatch chain so there will always be two references so 
  duplication is necessary. Since we have only 0 / 1 / 2+ information on 
  the references, we can't distinguish whether the second reference is due 
  to the dispatch or due to the passed object having more than one 
  reference, so we have to duplicate in any case. That is unfortunate, and 
  I don't see a way around (unless we handle subassignment methods is some 
  special way).
  
  I don't believe dispatch is bumping NAMED (and a quick experiment
  seems to confirm this though I don't guarantee I did that right). The
  issue is that a replacement function implemented as a closure, which
  is the only option for a package, will always see NAMED on the object
  to be modified as 2 (because the value is obtained by forcing the
  argument promise) and so any R level assignments will duplicate.  This
  also isn't really an issue of imprecise reference counting -- there
  really are (at least) two legitimate references -- one though the
  argument and one through the caller's environment.
  
  It would be good it we could come up with a way for packages to be
  able to define replacement functions that do not duplicate in cases
  where we really don't want them to, but this would require coming up
  with some sort of protocol, minimally involving an efficient way to
  detect whether a replacement funciton is being called in a replacement
  context or directly.
  
  Would $- always satisfy that condition. It would be big help to me if it 
  could be designed to avoid duplication the rest of the data.frame.
  
  -- 
  
  
  There are some replacement functions that use C code to cheat, but
  these may create problems if called directly, so I won't advertise
  them.
  
  Best,
  
  luke
  
  
  Cheers,
  

Re: [Rd] [datatable-help] speeding up perception

2011-07-06 Thread Simon Urbanek
Interesting, and I stand corrected:

 x = data.frame(a=1:n,b=1:n)
 .Internal(inspect(x))
@103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
  @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[1,1]=42L
 .Internal(inspect(x))
@10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
  @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
@103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
  @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
  @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
@10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
  @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


I have R to release ;) so I won't be looking into this right now, but it's 
something worth investigating ... Since all the inner contents have NAMED=0 I 
would not expect any duplication to be needed, but apparently becomes so is at 
some point ...

Cheers,
Simon


On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:

 
 On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
 No subassignment function satisfies that condition, because you can always 
 call them directly. However, that doesn't stop the default method from 
 making that assumption, so I'm not sure it's an issue.
 
 David, Just to clarify - the data frame content is not copied, we are 
 talking about the vector holding columns.
 
 If it is just the vector holding the columns that is copied (and not the
 columns themselves), why does n make a difference in this test (on R
 2.13.0)?
 
 n = 1000
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
  0.628   0.000   0.628 
 n = 10
 x = data.frame(a=1:n,b=1:n)  # still 2 columns, but longer columns
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
 20.145   1.232  21.455 
 
 
 With $- :
 
 n = 1000
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x$a[1] - 42L)
   user  system elapsed 
  0.304   0.000   0.307 
 n = 10
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x$a[1] - 42L)
   user  system elapsed 
 37.586   0.388  38.161 
 
 
 If it's because the 1st column needs to be copied (only) because that's
 the one being assigned to (in this test), that magnitude of slow down
 doesn't seem consistent with the time of a vector copy of the 1st
 column : 
 
 n=10
 v = 1:n
 system.time(for (i in 1:1000) v[1] - 42L)
   user  system elapsed 
  0.016   0.000   0.017 
 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L})
   user  system elapsed 
  1.816   1.076   2.900
 
 Finally, increasing the number of columns, again only the 1st is
 assigned to :
 
 n=10
 x = data.frame(rep(list(1:n),100))
 dim(x)
 [1] 10100
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
 167.974  50.903 219.711 
 
 
 
 
 
 Cheers,
 Simon
 
 Sent from my iPhone
 
 On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote:
 
 
 On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu 
 luke-tier...@uiowa.edu wrote:
 
 On Tue, 5 Jul 2011, Simon Urbanek wrote:
 
 
 On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
 
 Simon (and all),
 
 I've tried to make assignment as fast as calling `[-.data.table`
 directly, for user convenience. Profiling shows (IIUC) that it isn't
 dispatch, but x being copied. Is there a way to prevent '[-' from
 copying x?
 
 Good point, and conceptually, no. It's a subassignment after all - see 
 R-lang 3.4.4 - it is equivalent to
 
 `*tmp*` - x
 x - `[-`(`*tmp*`, i, j, value)
 rm(`*tmp*`)
 
 so there is always a copy involved.
 
 Now, a conceptual copy doesn't mean real copy in R since R tries to keep 
 the pass-by-value illusion while passing references in cases where it 
 knows that modifications cannot occur and/or they are safe. The default 
 subassign method uses that feature which means it can afford to not 
 duplicate if there is only one reference -- then it's safe to not 
 duplicate as we are replacing that only existing reference. And in the 
 case of a matrix, that will be true at the latest from the second 
 subassignment on.
 
 Unfortunately the method dispatch (AFAICS) introduces one more reference 
 in the dispatch chain so there will always be two references so 
 duplication is necessary. Since we have only 0 / 1 / 2+ information on 
 the references, we can't distinguish whether the second reference is due 
 to the dispatch or due to the passed object having more than one 
 reference, so we have to duplicate in any case. That is unfortunate, and 
 I don't see a way around (unless we handle subassignment methods is some 
 special way).
 
 I don't believe dispatch is bumping NAMED (and a quick experiment
 seems to confirm 

Re: [Rd] [datatable-help] speeding up perception

2011-07-06 Thread luke-tierney

On Wed, 6 Jul 2011, Simon Urbanek wrote:


Interesting, and I stand corrected:


x = data.frame(a=1:n,b=1:n)
.Internal(inspect(x))

@103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


x[1,1]=42L
.Internal(inspect(x))

@10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


x[[1]][1]=42L
.Internal(inspect(x))

@103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...


x[[1]][1]=42L
.Internal(inspect(x))

@10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


I have R to release ;) so I won't be looking into this right now, but it's 
something worth investigating ... Since all the inner contents have NAMED=0 I 
would not expect any duplication to be needed, but apparently becomes so is at 
some point ...



The internals assume in various places that deep copies are made (one
of the reasons NAMED setings are not propagated to sub-sturcture).
The main issues are avoiding cycles and that there is no easy way to
check for sharing.  There may be some circumstances in which a shallow
copy would be OK but making sure it would be in all cases is probably
more trouble than it is worth at this point. (I've tried this in the
past in a few cases and always had to back off.)


Best,

luke



Cheers,
Simon


On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:



On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:

No subassignment function satisfies that condition, because you can always call 
them directly. However, that doesn't stop the default method from making that 
assumption, so I'm not sure it's an issue.

David, Just to clarify - the data frame content is not copied, we are talking 
about the vector holding columns.


If it is just the vector holding the columns that is copied (and not the
columns themselves), why does n make a difference in this test (on R
2.13.0)?


n = 1000
x = data.frame(a=1:n,b=1:n)
system.time(for (i in 1:1000) x[1,1] - 42L)

  user  system elapsed
 0.628   0.000   0.628

n = 10
x = data.frame(a=1:n,b=1:n)  # still 2 columns, but longer columns
system.time(for (i in 1:1000) x[1,1] - 42L)

  user  system elapsed
20.145   1.232  21.455




With $- :


n = 1000
x = data.frame(a=1:n,b=1:n)
system.time(for (i in 1:1000) x$a[1] - 42L)

  user  system elapsed
 0.304   0.000   0.307

n = 10
x = data.frame(a=1:n,b=1:n)
system.time(for (i in 1:1000) x$a[1] - 42L)

  user  system elapsed
37.586   0.388  38.161




If it's because the 1st column needs to be copied (only) because that's
the one being assigned to (in this test), that magnitude of slow down
doesn't seem consistent with the time of a vector copy of the 1st
column :


n=10
v = 1:n
system.time(for (i in 1:1000) v[1] - 42L)

  user  system elapsed
 0.016   0.000   0.017

system.time(for (i in 1:1000) {v2=v;v2[1] - 42L})

  user  system elapsed
 1.816   1.076   2.900

Finally, increasing the number of columns, again only the 1st is
assigned to :


n=10
x = data.frame(rep(list(1:n),100))
dim(x)

[1] 10100

system.time(for (i in 1:1000) x[1,1] - 42L)

  user  system elapsed
167.974  50.903 219.711








Cheers,
Simon

Sent from my iPhone

On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote:



On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu 
wrote:


On Tue, 5 Jul 2011, Simon Urbanek wrote:



On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:


Simon (and all),

I've tried to make assignment as fast as calling `[-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[-' from
copying x?


Good point, and conceptually, no. It's a subassignment after all - see R-lang 
3.4.4 - it is equivalent to

`*tmp*` - x
x - `[-`(`*tmp*`, i, j, value)
rm(`*tmp*`)

so there is always a copy involved.

Now, a conceptual copy doesn't mean real copy in R since R tries to keep the 
pass-by-value illusion while passing references in cases where it knows that 
modifications cannot occur and/or they are safe. The default subassign method 
uses that feature which means it can afford to not duplicate if there is only 
one reference -- then it's safe to not duplicate as we are replacing that only 
existing reference. And in the case of a matrix, that will be true at the 
latest from the second subassignment on.

Unfortunately the method dispatch (AFAICS) introduces one more reference in the 
dispatch chain so there will always be two references so duplication is 
necessary. Since we have 

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread Matthew Dowle

Simon,

Thanks for the great suggestion. I've written a skeleton assignment
function for data.table which incurs no copies, which works for this
case. For completeness, if I understand correctly, this is for : 
  i) convenience of new users who don't know how to vectorize yet
  ii) more complex examples which can't be vectorized.

Before:

 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed 
 12.792   0.488  13.340 

After :

 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed 
  2.908   0.020   2.935

Where this can be reduced further as follows :

 system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))
   user  system elapsed 
  0.132   0.000   0.131 
 

Still working on it. When it doesn't break other data.table tests, I'll
commit to R-Forge ...

Matthew


On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
 Timothée,
 
 On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
 
  Hi --
  
  It's my first post on this list; as a relatively new user with little
  knowledge of R internals, I am a bit intimidated by the depth of some
  of the discussions here, so please spare me if I say something
  incredibly silly.
  
  I feel that someone at this point should mention Matthew Dowle's
  excellent data.table package
  (http://cran.r-project.org/web/packages/data.table/index.html) which
  seems to me to address many of the inefficiencies of data.frame.
  data.tables have no row names; and operations that only need data from
  one or two columns are (I believe) just as quick whether the total
  number of columns is 5 or 1000. This results in very quick operations
  (and, often, elegant code as well).
  
 
 I agree that data.table is a very good alternative (for other reasons) that 
 should be promoted more. The only slight snag is that it doesn't help with 
 the issue at hand since it simply does a pass-though for subassignments to 
 data frame's methods and thus suffers from the same problems (in fact there 
 is a rather stark asymmetry in how it handles subsetting vs subassignment - 
 which is a bit surprising [if I read the code correctly you can't use the 
 same indexing in both]). In fact I would propose that it should not do that 
 but handle the simple cases itself more efficiently without unneeded copies. 
 That would make it indeed a very interesting alternative.
 
 Cheers,
 Simon
 
 
  
  On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote:
  thank you, simon.  this was very interesting indeed.  I also now
  understand how far out of my depth I am here.
  
  fortunately, as an end user, obviously, *I* now know how to avoid the
  problem.  I particularly like the as.list() transformation and back to
  as.data.frame() to speed things up without loss of (much)
  functionality.
  
  
  more broadly, I view the avoidance of individual access through the
  use of apply and vector operations as a mixed IQ test and knowledge
  test (which I often fail).  However, even for the most clever, there
  are also situations where the KISS programming principle makes
  explicit loops still preferable.  Personally, I would have preferred
  it if R had, in its standard statistical data set data structure,
  foregone the row names feature in exchange for retaining fast direct
  access.  R could have reserved its current implementation with row
  names but slow access for a less common (possibly pseudo-inheriting)
  data structure.
  
  
  If end users commonly do iterations over a data frame, which I would
  guess to be the case, then the impression of R by (novice) end users
  could be greatly enhanced if the extreme penalties could be eliminated
  or at least flagged.  For example, I wonder if modest special internal
  code could store data frames internally and transparently as lists of
  vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
  but specific warning message could be issued with a suggestion if
  there is an individual read/write into a data frame (Warning: data
  frames are much slower than lists of vectors for individual element
  access).
  
  
  I would also suggest changing the Introduction to R 6.3  from A
  data frame may for many purposes be regarded as a matrix with columns
  possibly of differing modes and attributes. It may be displayed in
  matrix form, and its rows and columns extracted using matrix indexing
  conventions. to A data frame may for many purposes be regarded as a
  matrix with columns possibly of differing modes and attributes. It may
  be displayed in matrix form, and its rows and columns extracted using
  matrix indexing conventions.  However, data frames can be much slower
  than matrices or even lists of vectors (which, like data frames, can
  contain different types of columns) when individual elements need to
  be accessed.  Reading about it immediately upon introduction could
  flag the problem in a more visible manner.
  
  
  regards,
  
  /iaw
  
  __
  

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread Matthew Dowle
Simon (and all),

I've tried to make assignment as fast as calling `[-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[-' from
copying x?  Small reproducible example in vanilla R 2.13.0 :

 x = list(a=1:1,b=1:1)
 class(x) = newclass
 [-.newclass = function(x,i,j,value) x  # i.e. do nothing
 tracemem(x)
[1] 0xa1ec758
 x[1,2] = 42L
tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why?
 

I've tried returning NULL from [-.newclass but then x gets assigned
NULL :

 [-.newclass = function(x,i,j,value) NULL
 x[1,2] = 42L
tracemem[0xa1ec558 - 0x9c5f318]: 
 x
NULL
 

Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew


On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
 Simon,
 
 Thanks for the great suggestion. I've written a skeleton assignment
 function for data.table which incurs no copies, which works for this
 case. For completeness, if I understand correctly, this is for : 
   i) convenience of new users who don't know how to vectorize yet
   ii) more complex examples which can't be vectorized.
 
 Before:
 
  system.time(for (r in 1:R) DT[r,20] - 1.0)
user  system elapsed 
  12.792   0.488  13.340 
 
 After :
 
  system.time(for (r in 1:R) DT[r,20] - 1.0)
user  system elapsed 
   2.908   0.020   2.935
 
 Where this can be reduced further as follows :
 
  system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))
user  system elapsed 
   0.132   0.000   0.131 
  
 
 Still working on it. When it doesn't break other data.table tests, I'll
 commit to R-Forge ...
 
 Matthew
 
 
 On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
  Timothée,
  
  On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
  
   Hi --
   
   It's my first post on this list; as a relatively new user with little
   knowledge of R internals, I am a bit intimidated by the depth of some
   of the discussions here, so please spare me if I say something
   incredibly silly.
   
   I feel that someone at this point should mention Matthew Dowle's
   excellent data.table package
   (http://cran.r-project.org/web/packages/data.table/index.html) which
   seems to me to address many of the inefficiencies of data.frame.
   data.tables have no row names; and operations that only need data from
   one or two columns are (I believe) just as quick whether the total
   number of columns is 5 or 1000. This results in very quick operations
   (and, often, elegant code as well).
   
  
  I agree that data.table is a very good alternative (for other reasons) that 
  should be promoted more. The only slight snag is that it doesn't help with 
  the issue at hand since it simply does a pass-though for subassignments to 
  data frame's methods and thus suffers from the same problems (in fact there 
  is a rather stark asymmetry in how it handles subsetting vs subassignment - 
  which is a bit surprising [if I read the code correctly you can't use the 
  same indexing in both]). In fact I would propose that it should not do that 
  but handle the simple cases itself more efficiently without unneeded 
  copies. That would make it indeed a very interesting alternative.
  
  Cheers,
  Simon
  
  
   
   On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote:
   thank you, simon.  this was very interesting indeed.  I also now
   understand how far out of my depth I am here.
   
   fortunately, as an end user, obviously, *I* now know how to avoid the
   problem.  I particularly like the as.list() transformation and back to
   as.data.frame() to speed things up without loss of (much)
   functionality.
   
   
   more broadly, I view the avoidance of individual access through the
   use of apply and vector operations as a mixed IQ test and knowledge
   test (which I often fail).  However, even for the most clever, there
   are also situations where the KISS programming principle makes
   explicit loops still preferable.  Personally, I would have preferred
   it if R had, in its standard statistical data set data structure,
   foregone the row names feature in exchange for retaining fast direct
   access.  R could have reserved its current implementation with row
   names but slow access for a less common (possibly pseudo-inheriting)
   data structure.
   
   
   If end users commonly do iterations over a data frame, which I would
   guess to be the case, then the impression of R by (novice) end users
   could be greatly enhanced if the extreme penalties could be eliminated
   or at least flagged.  For example, I wonder if modest special internal
   code could store data frames internally and transparently as lists of
   vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
   but specific warning message could be issued with a suggestion if
   there is an 

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread Simon Urbanek

On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:

 Simon (and all),
 
 I've tried to make assignment as fast as calling `[-.data.table`
 directly, for user convenience. Profiling shows (IIUC) that it isn't
 dispatch, but x being copied. Is there a way to prevent '[-' from
 copying x?

Good point, and conceptually, no. It's a subassignment after all - see R-lang 
3.4.4 - it is equivalent to 

`*tmp*` - x
x - `[-`(`*tmp*`, i, j, value)
rm(`*tmp*`)

so there is always a copy involved.

Now, a conceptual copy doesn't mean real copy in R since R tries to keep the 
pass-by-value illusion while passing references in cases where it knows that 
modifications cannot occur and/or they are safe. The default subassign method 
uses that feature which means it can afford to not duplicate if there is only 
one reference -- then it's safe to not duplicate as we are replacing that only 
existing reference. And in the case of a matrix, that will be true at the 
latest from the second subassignment on.

Unfortunately the method dispatch (AFAICS) introduces one more reference in the 
dispatch chain so there will always be two references so duplication is 
necessary. Since we have only 0 / 1 / 2+ information on the references, we 
can't distinguish whether the second reference is due to the dispatch or due to 
the passed object having more than one reference, so we have to duplicate in 
any case. That is unfortunate, and I don't see a way around (unless we handle 
subassignment methods is some special way).

Cheers,
Simon



  Small reproducible example in vanilla R 2.13.0 :
 
 x = list(a=1:1,b=1:1)
 class(x) = newclass
 [-.newclass = function(x,i,j,value) x  # i.e. do nothing
 tracemem(x)
 [1] 0xa1ec758
 x[1,2] = 42L
 tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why?
 
 
 I've tried returning NULL from [-.newclass but then x gets assigned
 NULL :
 
 [-.newclass = function(x,i,j,value) NULL
 x[1,2] = 42L
 tracemem[0xa1ec558 - 0x9c5f318]: 
 x
 NULL
 
 
 Any pointers much appreciated. If that copy is preventable it should
 save the user needing to use `[-.data.table`(...) syntax to get the
 best speed (20 times faster on the small example used so far).
 
 Matthew
 
 
 On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
 Simon,
 
 Thanks for the great suggestion. I've written a skeleton assignment
 function for data.table which incurs no copies, which works for this
 case. For completeness, if I understand correctly, this is for : 
  i) convenience of new users who don't know how to vectorize yet
  ii) more complex examples which can't be vectorized.
 
 Before:
 
 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed 
 12.792   0.488  13.340 
 
 After :
 
 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed 
  2.908   0.020   2.935
 
 Where this can be reduced further as follows :
 
 system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))
   user  system elapsed 
  0.132   0.000   0.131 
 
 
 Still working on it. When it doesn't break other data.table tests, I'll
 commit to R-Forge ...
 
 Matthew
 
 
 On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
 Timothée,
 
 On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
 
 Hi --
 
 It's my first post on this list; as a relatively new user with little
 knowledge of R internals, I am a bit intimidated by the depth of some
 of the discussions here, so please spare me if I say something
 incredibly silly.
 
 I feel that someone at this point should mention Matthew Dowle's
 excellent data.table package
 (http://cran.r-project.org/web/packages/data.table/index.html) which
 seems to me to address many of the inefficiencies of data.frame.
 data.tables have no row names; and operations that only need data from
 one or two columns are (I believe) just as quick whether the total
 number of columns is 5 or 1000. This results in very quick operations
 (and, often, elegant code as well).
 
 
 I agree that data.table is a very good alternative (for other reasons) that 
 should be promoted more. The only slight snag is that it doesn't help with 
 the issue at hand since it simply does a pass-though for subassignments to 
 data frame's methods and thus suffers from the same problems (in fact there 
 is a rather stark asymmetry in how it handles subsetting vs subassignment - 
 which is a bit surprising [if I read the code correctly you can't use the 
 same indexing in both]). In fact I would propose that it should not do that 
 but handle the simple cases itself more efficiently without unneeded 
 copies. That would make it indeed a very interesting alternative.
 
 Cheers,
 Simon
 
 
 
 On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote:
 thank you, simon.  this was very interesting indeed.  I also now
 understand how far out of my depth I am here.
 
 fortunately, as an end user, obviously, *I* now know how to avoid the
 problem.  I particularly like the as.list() transformation and back to
 as.data.frame() to speed 

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread luke-tierney

On Tue, 5 Jul 2011, Matthew Dowle wrote:


Simon (and all),

I've tried to make assignment as fast as calling `[-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[-' from
copying x?  Small reproducible example in vanilla R 2.13.0 :


x = list(a=1:1,b=1:1)
class(x) = newclass
[-.newclass = function(x,i,j,value) x  # i.e. do nothing
tracemem(x)

[1] 0xa1ec758

x[1,2] = 42L

tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why?




This one is a red herring -- the class(x) - newclass assignment is
bumping up the NAMED value and as a result the following assignment
needs to duplicate. (the primitive class- could be modified to avoid
the NAMED bump but it's fairly intricate code so I'm not going to look
into it now).

[A bit more later in reply to Simon's message]

luke



I've tried returning NULL from [-.newclass but then x gets assigned
NULL :


[-.newclass = function(x,i,j,value) NULL
x[1,2] = 42L
tracemem[0xa1ec558 - 0x9c5f318]: 

x

NULL




Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew


On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:

Simon,

Thanks for the great suggestion. I've written a skeleton assignment
function for data.table which incurs no copies, which works for this
case. For completeness, if I understand correctly, this is for :
  i) convenience of new users who don't know how to vectorize yet
  ii) more complex examples which can't be vectorized.

Before:

 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed
 12.792   0.488  13.340 


After :

 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed
  2.908   0.020   2.935

Where this can be reduced further as follows :

 system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))
   user  system elapsed
  0.132   0.000   0.131 
 


Still working on it. When it doesn't break other data.table tests, I'll
commit to R-Forge ...

Matthew


On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
 Timothée,
 
 On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
 
  Hi --
  
  It's my first post on this list; as a relatively new user with little

  knowledge of R internals, I am a bit intimidated by the depth of some
  of the discussions here, so please spare me if I say something
  incredibly silly.
  
  I feel that someone at this point should mention Matthew Dowle's

  excellent data.table package
  (http://cran.r-project.org/web/packages/data.table/index.html) which
  seems to me to address many of the inefficiencies of data.frame.
  data.tables have no row names; and operations that only need data from
  one or two columns are (I believe) just as quick whether the total
  number of columns is 5 or 1000. This results in very quick operations
  (and, often, elegant code as well).
  
 
 I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
 
 Cheers,

 Simon
 
 
  
  On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote:

  thank you, simon.  this was very interesting indeed.  I also now
  understand how far out of my depth I am here.
  
  fortunately, as an end user, obviously, *I* now know how to avoid the

  problem.  I particularly like the as.list() transformation and back to
  as.data.frame() to speed things up without loss of (much)
  functionality.
  
  
  more broadly, I view the avoidance of individual access through the

  use of apply and vector operations as a mixed IQ test and knowledge
  test (which I often fail).  However, even for the most clever, there
  are also situations where the KISS programming principle makes
  explicit loops still preferable.  Personally, I would have preferred
  it if R had, in its standard statistical data set data structure,
  foregone the row names feature in exchange for retaining fast direct
  access.  R could have reserved its current implementation with row
  names but slow access for a less common (possibly pseudo-inheriting)
  data structure.
  
  
  If end users commonly do iterations over a data frame, which I would

  guess to be the case, then the impression of R by (novice) end users
  could be greatly enhanced if the extreme penalties could be eliminated
  or at least 

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread luke-tierney

On Tue, 5 Jul 2011, Simon Urbanek wrote:



On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:


Simon (and all),

I've tried to make assignment as fast as calling `[-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[-' from
copying x?


Good point, and conceptually, no. It's a subassignment after all - see R-lang 
3.4.4 - it is equivalent to

`*tmp*` - x
x - `[-`(`*tmp*`, i, j, value)
rm(`*tmp*`)

so there is always a copy involved.

Now, a conceptual copy doesn't mean real copy in R since R tries to keep the 
pass-by-value illusion while passing references in cases where it knows that 
modifications cannot occur and/or they are safe. The default subassign method 
uses that feature which means it can afford to not duplicate if there is only 
one reference -- then it's safe to not duplicate as we are replacing that only 
existing reference. And in the case of a matrix, that will be true at the 
latest from the second subassignment on.

Unfortunately the method dispatch (AFAICS) introduces one more reference in the 
dispatch chain so there will always be two references so duplication is 
necessary. Since we have only 0 / 1 / 2+ information on the references, we 
can't distinguish whether the second reference is due to the dispatch or due to 
the passed object having more than one reference, so we have to duplicate in 
any case. That is unfortunate, and I don't see a way around (unless we handle 
subassignment methods is some special way).


I don't believe dispatch is bumping NAMED (and a quick experiment
seems to confirm this though I don't guarantee I did that right). The
issue is that a replacement function implemented as a closure, which
is the only option for a package, will always see NAMED on the object
to be modified as 2 (because the value is obtained by forcing the
argument promise) and so any R level assignments will duplicate.  This
also isn't really an issue of imprecise reference counting -- there
really are (at least) two legitimate references -- one though the
argument and one through the caller's environment.

It would be good it we could come up with a way for packages to be
able to define replacement functions that do not duplicate in cases
where we really don't want them to, but this would require coming up
with some sort of protocol, minimally involving an efficient way to
detect whether a replacement funciton is bing called in a replacement
context or directly.

There are some replacement functions that use C code to cheat, but
these may create problems if called directly, so I won't advertise
them.

Best,

luke



Cheers,
Simon




 Small reproducible example in vanilla R 2.13.0 :


x = list(a=1:1,b=1:1)
class(x) = newclass
[-.newclass = function(x,i,j,value) x  # i.e. do nothing
tracemem(x)

[1] 0xa1ec758

x[1,2] = 42L

tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why?




I've tried returning NULL from [-.newclass but then x gets assigned
NULL :


[-.newclass = function(x,i,j,value) NULL
x[1,2] = 42L

tracemem[0xa1ec558 - 0x9c5f318]:

x

NULL




Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew


On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:

Simon,

Thanks for the great suggestion. I've written a skeleton assignment
function for data.table which incurs no copies, which works for this
case. For completeness, if I understand correctly, this is for :
 i) convenience of new users who don't know how to vectorize yet
 ii) more complex examples which can't be vectorized.

Before:


system.time(for (r in 1:R) DT[r,20] - 1.0)

  user  system elapsed
12.792   0.488  13.340

After :


system.time(for (r in 1:R) DT[r,20] - 1.0)

  user  system elapsed
 2.908   0.020   2.935

Where this can be reduced further as follows :


system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))

  user  system elapsed
 0.132   0.000   0.131




Still working on it. When it doesn't break other data.table tests, I'll
commit to R-Forge ...

Matthew


On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:

Timothée,

On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:


Hi --

It's my first post on this list; as a relatively new user with little
knowledge of R internals, I am a bit intimidated by the depth of some
of the discussions here, so please spare me if I say something
incredibly silly.

I feel that someone at this point should mention Matthew Dowle's
excellent data.table package
(http://cran.r-project.org/web/packages/data.table/index.html) which
seems to me to address many of the inefficiencies of data.frame.
data.tables have no row names; and operations that only need data from
one or two columns are (I believe) just as quick whether the total
number of columns is 5 or 1000. This results in very quick operations
(and, 

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread David Winsemius


On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu 
 wrote:



On Tue, 5 Jul 2011, Simon Urbanek wrote:



On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:


Simon (and all),

I've tried to make assignment as fast as calling `[-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[-' from
copying x?


Good point, and conceptually, no. It's a subassignment after all -  
see R-lang 3.4.4 - it is equivalent to


`*tmp*` - x
x - `[-`(`*tmp*`, i, j, value)
rm(`*tmp*`)

so there is always a copy involved.

Now, a conceptual copy doesn't mean real copy in R since R tries to  
keep the pass-by-value illusion while passing references in cases  
where it knows that modifications cannot occur and/or they are  
safe. The default subassign method uses that feature which means it  
can afford to not duplicate if there is only one reference -- then  
it's safe to not duplicate as we are replacing that only existing  
reference. And in the case of a matrix, that will be true at the  
latest from the second subassignment on.


Unfortunately the method dispatch (AFAICS) introduces one more  
reference in the dispatch chain so there will always be two  
references so duplication is necessary. Since we have only 0 / 1 /  
2+ information on the references, we can't distinguish whether the  
second reference is due to the dispatch or due to the passed object  
having more than one reference, so we have to duplicate in any  
case. That is unfortunate, and I don't see a way around (unless we  
handle subassignment methods is some special way).


I don't believe dispatch is bumping NAMED (and a quick experiment
seems to confirm this though I don't guarantee I did that right). The
issue is that a replacement function implemented as a closure, which
is the only option for a package, will always see NAMED on the object
to be modified as 2 (because the value is obtained by forcing the
argument promise) and so any R level assignments will duplicate.  This
also isn't really an issue of imprecise reference counting -- there
really are (at least) two legitimate references -- one though the
argument and one through the caller's environment.

It would be good it we could come up with a way for packages to be
able to define replacement functions that do not duplicate in cases
where we really don't want them to, but this would require coming up
with some sort of protocol, minimally involving an efficient way to
detect whether a replacement funciton is being called in a replacement
context or directly.


Would $- always satisfy that condition. It would be big help to me  
if it could be designed to avoid duplication the rest of the data.frame.


--



There are some replacement functions that use C code to cheat, but
these may create problems if called directly, so I won't advertise
them.

Best,

luke



Cheers,
Simon





--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
  Actuarial Science
241 Schaeffer Hall  email:  l...@stat.uiowa.edu
Iowa City, IA 52242 WWW:  http:// 
www.stat.uiowa.edu__

R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


David Winsemius, MD
West Hartford, CT

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread Simon Urbanek
No subassignment function satisfies that condition, because you can always call 
them directly. However, that doesn't stop the default method from making that 
assumption, so I'm not sure it's an issue.

David, Just to clarify - the data frame content is not copied, we are talking 
about the vector holding columns.

Cheers,
Simon

Sent from my iPhone

On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote:

 
 On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu 
 wrote:
 
 On Tue, 5 Jul 2011, Simon Urbanek wrote:
 
 
 On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
 
 Simon (and all),
 
 I've tried to make assignment as fast as calling `[-.data.table`
 directly, for user convenience. Profiling shows (IIUC) that it isn't
 dispatch, but x being copied. Is there a way to prevent '[-' from
 copying x?
 
 Good point, and conceptually, no. It's a subassignment after all - see 
 R-lang 3.4.4 - it is equivalent to
 
 `*tmp*` - x
 x - `[-`(`*tmp*`, i, j, value)
 rm(`*tmp*`)
 
 so there is always a copy involved.
 
 Now, a conceptual copy doesn't mean real copy in R since R tries to keep 
 the pass-by-value illusion while passing references in cases where it knows 
 that modifications cannot occur and/or they are safe. The default subassign 
 method uses that feature which means it can afford to not duplicate if 
 there is only one reference -- then it's safe to not duplicate as we are 
 replacing that only existing reference. And in the case of a matrix, that 
 will be true at the latest from the second subassignment on.
 
 Unfortunately the method dispatch (AFAICS) introduces one more reference in 
 the dispatch chain so there will always be two references so duplication is 
 necessary. Since we have only 0 / 1 / 2+ information on the references, we 
 can't distinguish whether the second reference is due to the dispatch or 
 due to the passed object having more than one reference, so we have to 
 duplicate in any case. That is unfortunate, and I don't see a way around 
 (unless we handle subassignment methods is some special way).
 
 I don't believe dispatch is bumping NAMED (and a quick experiment
 seems to confirm this though I don't guarantee I did that right). The
 issue is that a replacement function implemented as a closure, which
 is the only option for a package, will always see NAMED on the object
 to be modified as 2 (because the value is obtained by forcing the
 argument promise) and so any R level assignments will duplicate.  This
 also isn't really an issue of imprecise reference counting -- there
 really are (at least) two legitimate references -- one though the
 argument and one through the caller's environment.
 
 It would be good it we could come up with a way for packages to be
 able to define replacement functions that do not duplicate in cases
 where we really don't want them to, but this would require coming up
 with some sort of protocol, minimally involving an efficient way to
 detect whether a replacement funciton is being called in a replacement
 context or directly.
 
 Would $- always satisfy that condition. It would be big help to me if it 
 could be designed to avoid duplication the rest of the data.frame.
 
 -- 
 
 
 There are some replacement functions that use C code to cheat, but
 these may create problems if called directly, so I won't advertise
 them.
 
 Best,
 
 luke
 
 
 Cheers,
 Simon
 
 
 
 
 -- 
 Luke Tierney
 Statistics and Actuarial Science
 Ralph E. Wareham Professor of Mathematical Sciences
 University of Iowa  Phone: 319-335-3386
 Department of Statistics andFax:   319-335-3017
  Actuarial Science
 241 Schaeffer Hall  email:  l...@stat.uiowa.edu
 Iowa City, IA 52242 WWW:  
 http://www.stat.uiowa.edu__
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 David Winsemius, MD
 West Hartford, CT
 
 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel