Re: [Rd] [datatable-help] speeding up perception
Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;). Personally, I find DT[1,V1 - 3] highly disturbing - I would expect it to evaluate to { V1 - 3; DT[1, V1] } thus returning the first element of the third column. Please see FAQ 1.1, since further below it seems to be an expectation issue about 'with' syntax, too. That said, I don't think it works, either. Taking you example and data.table form r-forge: [ snip ] as you can see, DT is not modified. Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce the non-working state I'll need some more environment information please. Also I suspect there is something quite amiss because even trivial things don't work: DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 DT[1:4,1:4] [1] 1 2 3 4 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 and 1.10. When I first saw your proposal, I thought you have rather something like within(DT, V1[1] - 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): system.time(for (i in 1:10) within(DT, V1[1] - 3)) user system elapsed 2.701 4.437 7.138 No, since 'with' is already built into data.table, I was thinking of building 'within' in, too. I'll take a look at within(). Might as well provide as many options as possible to the user to use as they wish. With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) performs reasonably: system.time(within(DT, for (i in 1:1000) V1[i] - 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Noted, thanks. That's pretty fast. Does within() on data.frame fix the original issue Ivo raised, then? If so, job done. Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... The internals
Re: [Rd] [datatable-help] speeding up perception
Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... The internals assume in various places that deep copies are made (one of the reasons NAMED setings are not propagated to sub-sturcture). The main issues are avoiding cycles and that there is no easy way to check for sharing. There may be some circumstances in which a shallow copy would be OK but making sure it would be in all cases is probably more trouble than it is worth at this point. (I've tried this in the past in a few cases and always had to back off.) Best, luke Cheers, Simon On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)? n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 0.628 0.000 0.628 n = 10 x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 20.145 1.232 21.455 With $- : n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 0.304 0.000 0.307 n = 10 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 37.586 0.388 38.161 If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column : n=10 v = 1:n system.time(for (i in 1:1000) v[1] - 42L) user
Re: [Rd] [datatable-help] speeding up perception
Simon, If you didn't install.packages() with method=source from R-Forge, that would explain (some of) it. R-Forge builds binaries once each night. This commit was long after the cutoff. Matthew Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;). Personally, I find DT[1,V1 - 3] highly disturbing - I would expect it to evaluate to { V1 - 3; DT[1, V1] } thus returning the first element of the third column. Please see FAQ 1.1, since further below it seems to be an expectation issue about 'with' syntax, too. That said, I don't think it works, either. Taking you example and data.table form r-forge: [ snip ] as you can see, DT is not modified. Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce the non-working state I'll need some more environment information please. Also I suspect there is something quite amiss because even trivial things don't work: DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 DT[1:4,1:4] [1] 1 2 3 4 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 and 1.10. When I first saw your proposal, I thought you have rather something like within(DT, V1[1] - 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): system.time(for (i in 1:10) within(DT, V1[1] - 3)) user system elapsed 2.701 4.437 7.138 No, since 'with' is already built into data.table, I was thinking of building 'within' in, too. I'll take a look at within(). Might as well provide as many options as possible to the user to use as they wish. With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) performs reasonably: system.time(within(DT, for (i in 1:1000) V1[i] - 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Noted, thanks. That's pretty fast. Does within() on data.frame fix the original issue Ivo raised, then? If so, job done. Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this
Re: [Rd] [datatable-help] speeding up perception
On Jul 12, 2011, at 6:24 AM, Matthew Dowle wrote: Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;). Personally, I find DT[1,V1 - 3] highly disturbing - I would expect it to evaluate to { V1 - 3; DT[1, V1] } thus returning the first element of the third column. Please see FAQ 1.1, since further below it seems to be an expectation issue about 'with' syntax, too. Just to clarify - the NEWS has led me to believe that the destructive DT[i, x - y] syntax is new. That is what my objection is about. I'm fine with subsetting operators working on expressions but I'm not happy with subsetting operators modifying the the object they are subsetting - since it's subsetting not subassignemnt - that's what I was referring to. That said, I don't think it works, either. Taking you example and data.table form r-forge: [ snip ] as you can see, DT is not modified. Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce the non-working state I'll need some more environment information please. The issue persist on several machines I tested - including R 2.13.0: sessionInfo() R version 2.13.0 Patched (2011-05-15 r55914) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.6.3 sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-unknown-linux-gnu/amd64 (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.6.3 DT = as.data.table(m) for (i in 1:1000) DT[1,V1 - 3] DT[1,] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 [1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Also I suspect there is something quite amiss because even trivial things don't work: DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 DT[1:4,1:4] [1] 1 2 3 4 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 and 1.10. Fair enough, I expected data.table to be a drop-in replacement of data.frames - I just wanted to check the values. Apparently it's not, by design, hence assumption was wrong. When I first saw your proposal, I thought you have rather something like within(DT, V1[1] - 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): system.time(for (i in 1:10) within(DT, V1[1] - 3)) user system elapsed 2.701 4.437 7.138 No, since 'with' is already built into data.table, I was thinking of building 'within' in, too. I'll take a look at within(). Might as well provide as many options as possible to the user to use as they wish. With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) performs reasonably: system.time(within(DT, for (i in 1:1000) V1[i] - 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Noted, thanks. That's pretty fast. Does within() on data.frame fix the original issue Ivo raised, then? If so, job done. I don't think so - at least not in the strict sense of no copies (more digging may be needed, though, since it does so in system.time, possibly due to the NAMED value of the forced promise but I did not check). However, it allows to express the modification inside the expression which will save the global copy and thus be faster that the outside loop. Cheers, Simon Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now
Re: [Rd] [datatable-help] speeding up perception
Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;). Personally, I find DT[1,V1 - 3] highly disturbing - I would expect it to evaluate to { V1 - 3; DT[1, V1] } thus returning the first element of the third column. I do understand that within(foo, expr, ...) was the motivation for passing expressions, but unlike within() the subsetting operator [ is not expected to take expression as its second argument. Such abuse is quite unexpected and I would say dangerous. That said, I don't think it works, either. Taking you example and data.table form r-forge: m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) for (i in 1:1000) DT[1,V1 - 3] DT[1,] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 [1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 as you can see, DT is not modified. Also I suspect there is something quite amiss because even trivial things don't work: DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 DT[1:4,1:4] [1] 1 2 3 4 When I first saw your proposal, I thought you have rather something like within(DT, V1[1] - 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): system.time(for (i in 1:10) within(DT, V1[1] - 3)) user system elapsed 2.701 4.437 7.138 With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) performs reasonably: system.time(within(DT, for (i in 1:1000) V1[i] - 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... The internals assume in various places that deep copies are made (one of the reasons NAMED setings are not propagated to sub-sturcture). The main
Re: [Rd] [datatable-help] speeding up perception
On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)? n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 0.628 0.000 0.628 n = 10 x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 20.145 1.232 21.455 With $- : n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 0.304 0.000 0.307 n = 10 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 37.586 0.388 38.161 If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column : n=10 v = 1:n system.time(for (i in 1:1000) v[1] - 42L) user system elapsed 0.016 0.000 0.017 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L}) user system elapsed 1.816 1.076 2.900 Finally, increasing the number of columns, again only the 1st is assigned to : n=10 x = data.frame(rep(list(1:n),100)) dim(x) [1] 10100 system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 167.974 50.903 219.711 Cheers, Simon Sent from my iPhone On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote: On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is being called in a replacement context or directly. Would $- always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame. -- There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers,
Re: [Rd] [datatable-help] speeding up perception
Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... Cheers, Simon On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)? n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 0.628 0.000 0.628 n = 10 x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 20.145 1.232 21.455 With $- : n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 0.304 0.000 0.307 n = 10 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 37.586 0.388 38.161 If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column : n=10 v = 1:n system.time(for (i in 1:1000) v[1] - 42L) user system elapsed 0.016 0.000 0.017 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L}) user system elapsed 1.816 1.076 2.900 Finally, increasing the number of columns, again only the 1st is assigned to : n=10 x = data.frame(rep(list(1:n),100)) dim(x) [1] 10100 system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 167.974 50.903 219.711 Cheers, Simon Sent from my iPhone On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote: On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm
Re: [Rd] [datatable-help] speeding up perception
On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... The internals assume in various places that deep copies are made (one of the reasons NAMED setings are not propagated to sub-sturcture). The main issues are avoiding cycles and that there is no easy way to check for sharing. There may be some circumstances in which a shallow copy would be OK but making sure it would be in all cases is probably more trouble than it is worth at this point. (I've tried this in the past in a few cases and always had to back off.) Best, luke Cheers, Simon On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)? n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 0.628 0.000 0.628 n = 10 x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 20.145 1.232 21.455 With $- : n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 0.304 0.000 0.307 n = 10 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 37.586 0.388 38.161 If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column : n=10 v = 1:n system.time(for (i in 1:1000) v[1] - 42L) user system elapsed 0.016 0.000 0.017 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L}) user system elapsed 1.816 1.076 2.900 Finally, increasing the number of columns, again only the 1st is assigned to : n=10 x = data.frame(rep(list(1:n),100)) dim(x) [1] 10100 system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 167.974 50.903 219.711 Cheers, Simon Sent from my iPhone On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote: On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have
Re: [Rd] [datatable-help] speeding up perception
Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least flagged. For example, I wonder if modest special internal code could store data frames internally and transparently as lists of vectors UNTIL a row name is assigned to. Easier and uglier, a simple but specific warning message could be issued with a suggestion if there is an individual read/write into a data frame (Warning: data frames are much slower than lists of vectors for individual element access). I would also suggest changing the Introduction to R 6.3 from A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. to A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. However, data frames can be much slower than matrices or even lists of vectors (which, like data frames, can contain different types of columns) when individual elements need to be accessed. Reading about it immediately upon introduction could flag the problem in a more visible manner. regards, /iaw __
Re: [Rd] [datatable-help] speeding up perception
Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least flagged. For example, I wonder if modest special internal code could store data frames internally and transparently as lists of vectors UNTIL a row name is assigned to. Easier and uglier, a simple but specific warning message could be issued with a suggestion if there is an
Re: [Rd] [datatable-help] speeding up perception
On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). Cheers, Simon Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed
Re: [Rd] [datatable-help] speeding up perception
On Tue, 5 Jul 2011, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? This one is a red herring -- the class(x) - newclass assignment is bumping up the NAMED value and as a result the following assignment needs to duplicate. (the primitive class- could be modified to avoid the NAMED bump but it's fairly intricate code so I'm not going to look into it now). [A bit more later in reply to Simon's message] luke I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least
Re: [Rd] [datatable-help] speeding up perception
On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is bing called in a replacement context or directly. There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers, Simon Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and,
Re: [Rd] [datatable-help] speeding up perception
On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is being called in a replacement context or directly. Would $- always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame. -- There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers, Simon -- Luke Tierney Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: l...@stat.uiowa.edu Iowa City, IA 52242 WWW: http:// www.stat.uiowa.edu__ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel David Winsemius, MD West Hartford, CT __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. Cheers, Simon Sent from my iPhone On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote: On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is being called in a replacement context or directly. Would $- always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame. -- There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers, Simon -- Luke Tierney Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: l...@stat.uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu__ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel David Winsemius, MD West Hartford, CT __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel