Re: [Rd] Circumventing code/documentation mismatches ('R CMD check')
On Jul 5, 2011, at 08:00 , Johannes Graumann wrote: Hello, As prompted by B. Ripley (see below), I am transfering this over from R-User ... For a package I am writing a function that looks like test - function(Argument1=NA){ # Prerequisite testing if(!(is.na(Argument1))){ if(!(is.character(Argument1))){ stop(Wrong class.) } } # Function Body cat(Hello World\n) } Documentation of this is straight forward: ... \usage{test(Argument1=NA)} ... However writing the function could be made more concise like so: test2 - function(Argument1=NA_character_){ # Prerequisite testing if(!(is.character(Argument1))){ stop(Wrong class.) } # Function Body cat(Hello World\n) } To prevent confusion I do not want to use 'NA_character_' in the user- exposed documentation and using ... \usage{test2(Argument1=NA)} ... leads to a warning reagrding a code/documentation mismatch. Is there any way to prevent that? You don't want to do that... That strategy breaks if someone passes the documented default explicitly, which certainly _causes_ confusion rather than prevent it. I.e. test2(NA) # fails test3 - function(a=NA) test2(a) # 3rd party code might build on your function test3() # fails If your function only accept character values, even if NA, then that is what should be documented. In the end, you'll find that an explicit is.na() is the right thing to do. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least flagged. For example, I wonder if modest special internal code could store data frames internally and transparently as lists of vectors UNTIL a row name is assigned to. Easier and uglier, a simple but specific warning message could be issued with a suggestion if there is an individual read/write into a data frame (Warning: data frames are much slower than lists of vectors for individual element access). I would also suggest changing the Introduction to R 6.3 from A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. to A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. However, data frames can be much slower than matrices or even lists of vectors (which, like data frames, can contain different types of columns) when individual elements need to be accessed. Reading about it immediately upon introduction could flag the problem in a more visible manner. regards, /iaw __
Re: [Rd] Recent and upcoming changes to R-devel
L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? Best, Tobias P.S. One could even consider a post-install option e.g. to add 'real' R sources (and source references) to Windows packages (which are by definition already 'installed' and for which such information is not by default included in the CRAN binaries of these packages). Prof Brian Ripley wrote: There was an R-core meeting the week before last, and various planned changes will appear in R-devel over the next few weeks. These are changes planned for R 2.14.0 scheduled for Oct 31. As we are sick of people referring to R-devel as '2.14' or '2.14.0', that version number will not be used until we reach 2.14.0 alpha. You will be able to have a package depend on an svn version number when referring to R-devel rather than using R (= 2.14.0). All packages are installed with lazy-loading (there were 72 CRAN packages and 8 BioC packages which opted out). This means that the code is always parsed at install time which inter alia simplifies the descriptions. R 2.13.1 RC warns on installation about packages which ask not to be lazy-loaded, and R-devel ignores such requests (with a warning). In the near future all packages will have a name space. If the sources do not contain one, a default NAMESPACE file will be added. This again will simplify the descriptions and also a lot of internal code. Maintainers of packages without name spaces (currently 42% of CRAN) are encouraged to add one themselves. R-devel is installed with the base and recommended packages byte-compiled (the equivalent of 'make bytecode' in R 2.13.x, but done less inefficiently). There is a new option R CMD INSTALL --byte-compile to byte-compile contributed packages, but that remains optional. Byte-compilation is quite expensive (so you definitely want to do it at install time, which requires lazy-loading), and relatively few packages benefit appreciably from byte-compilation. A larger number of packages benefit from byte-compilation of R itself: for example AER runs its checks 10% faster. The byte-compiler technology is thanks to Luke Tierney. There is support for figures in Rd files: currently with a first-pass implementation (thanks to Duncan Murdoch). __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Recent and upcoming changes to R-devel
On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. Duncan Murdoch Best, Tobias P.S. One could even consider a post-install option e.g. to add 'real' R sources (and source references) to Windows packages (which are by definition already 'installed' and for which such information is not by default included in the CRAN binaries of these packages). Prof Brian Ripley wrote: There was an R-core meeting the week before last, and various planned changes will appear in R-devel over the next few weeks. These are changes planned for R 2.14.0 scheduled for Oct 31. As we are sick of people referring to R-devel as '2.14' or '2.14.0', that version number will not be used until we reach 2.14.0 alpha. You will be able to have a package depend on an svn version number when referring to R-devel rather than using R (= 2.14.0). All packages are installed with lazy-loading (there were 72 CRAN packages and 8 BioC packages which opted out). This means that the code is always parsed at install time which inter alia simplifies the descriptions. R 2.13.1 RC warns on installation about packages which ask not to be lazy-loaded, and R-devel ignores such requests (with a warning). In the near future all packages will have a name space. If the sources do not contain one, a default NAMESPACE file will be added. This again will simplify the descriptions and also a lot of internal code. Maintainers of packages without name spaces (currently 42% of CRAN) are encouraged to add one themselves. R-devel is installed with the base and recommended packages byte-compiled (the equivalent of 'make bytecode' in R 2.13.x, but done less inefficiently). There is a new option R CMD INSTALL --byte-compile to byte-compile contributed packages, but that remains optional. Byte-compilation is quite expensive (so you definitely want to do it at install time, which requires
Re: [Rd] Recent and upcoming changes to R-devel
Dear Duncan, On 07/05/2011 03:25 PM, Duncan Murdoch wrote: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. Many thanks for your reaction. Is the R_KEEP_PKG_SOURCE=yes environment variable also supported during R installation ? I hope I'm not overlooking anything, but when compiling ftp://ftp.stat.math.ethz.ch/Software/R/R-devel.tar.gz a few minutes ago I encountered the following issue: [...] building package 'tools' mkdir -p -- ../../../library/tools make[4]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools' mkdir -p -- ../../../library/tools/R mkdir -p -- ../../../library/tools/po make[4]: Leaving directory `/home/tobias/rAdmin/R-devel/src/library/tools' make[4]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools' make[5]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' making text.d from text.c making init.d from init.c making Rmd5.d from Rmd5.c making md5.d from md5.c gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c text.c -o text.o gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c init.c -o init.o gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c Rmd5.c -o Rmd5.o gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c md5.c -o md5.o gcc -std=gnu99 -shared -L/usr/local/lib64 -o tools.so text.o init.o Rmd5.o md5.o -L../../../../lib -lR make[6]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' make[6]: `Makedeps' is up to date. make[6]: Leaving directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' make[6]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' mkdir -p -- ../../../../library/tools/libs make[6]: Leaving directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' make[5]: Leaving directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' make[4]: Leaving directory
Re: [Rd] Recent and upcoming changes to R-devel
On 05/07/2011 10:17 AM, Tobias Verbeke wrote: Dear Duncan, On 07/05/2011 03:25 PM, Duncan Murdoch wrote: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. Many thanks for your reaction. Is the R_KEEP_PKG_SOURCE=yes environment variable also supported during R installation ? Yes, other than the error you saw below, which is a temporary problem. Not sure which function exceeded the length limit, but the length limit is going away before 2.14.0 is released. Duncan Murdoch I hope I'm not overlooking anything, but when compiling ftp://ftp.stat.math.ethz.ch/Software/R/R-devel.tar.gz a few minutes ago I encountered the following issue: [...] building package 'tools' mkdir -p -- ../../../library/tools make[4]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools' mkdir -p -- ../../../library/tools/R mkdir -p -- ../../../library/tools/po make[4]: Leaving directory `/home/tobias/rAdmin/R-devel/src/library/tools' make[4]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools' make[5]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' making text.d from text.c making init.d from init.c making Rmd5.d from Rmd5.c making md5.d from md5.c gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c text.c -o text.o gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c init.c -o init.o gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c Rmd5.c -o Rmd5.o gcc -std=gnu99 -I../../../../include -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c md5.c -o md5.o gcc -std=gnu99 -shared -L/usr/local/lib64 -o tools.so text.o init.o Rmd5.o md5.o -L../../../../lib -lR make[6]: Entering directory `/home/tobias/rAdmin/R-devel/src/library/tools/src' make[6]: `Makedeps' is up to date. make[6]: Leaving directory
Re: [Rd] Recent and upcoming changes to R-devel
On 05/07/2011 11:20 AM, Stephan Wahlbrink wrote: Dear developers, Duncan Murdoch wrote [2011-07-05 15:25]: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. I don’t know how the new installation works exactly, but would it be possible, to simply install both types, the old expression bodies and the new byte-compiled, as single package at the same time? Yes, that's what is done. This would allow the R user and developer to simply use the variant which is the best at the moment. If he wants to debug code, he can switch of the use of byte-compiled code and use the old R expressions (with attached srcrefs). If debugging is not required, he can profit from the byte-compiled version. The best would be a toggle, to switch it at runtime, but a startup option would be sufficient too. I think direct access to the code is one big advantage of open source software. For developer it makes it easier to find and fix bugs if something is wrong. But it can also help users a lot to understand how a function or algorithm works and learn from code written by other persons – if the access to the sources is easy. As long byte-code doesn’t support the debugging features of R, it is required for best debugging support to run the functions completely without byte-complied code. If I understood it correctly, byte-code frames would disable srcrefs as well as features like “step return” to that frames. Therefore I ask for a way that it is easy to switch between both execution types. What gave you that impression? Duncan Murdoch Best, Stephan Duncan Murdoch Best, Tobias P.S. One could even consider a post-install option e.g. to add 'real' R sources (and source references) to Windows packages (which are by definition already 'installed' and for which such information is not by default included in the CRAN binaries of these packages). Prof Brian Ripley wrote: There was an R-core
Re: [Rd] Recent and upcoming changes to R-devel
Dear developers, Duncan Murdoch wrote [2011-07-05 15:25]: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. I don’t know how the new installation works exactly, but would it be possible, to simply install both types, the old expression bodies and the new byte-compiled, as single package at the same time? This would allow the R user and developer to simply use the variant which is the best at the moment. If he wants to debug code, he can switch of the use of byte-compiled code and use the old R expressions (with attached srcrefs). If debugging is not required, he can profit from the byte-compiled version. The best would be a toggle, to switch it at runtime, but a startup option would be sufficient too. I think direct access to the code is one big advantage of open source software. For developer it makes it easier to find and fix bugs if something is wrong. But it can also help users a lot to understand how a function or algorithm works and learn from code written by other persons – if the access to the sources is easy. As long byte-code doesn’t support the debugging features of R, it is required for best debugging support to run the functions completely without byte-complied code. If I understood it correctly, byte-code frames would disable srcrefs as well as features like “step return” to that frames. Therefore I ask for a way that it is easy to switch between both execution types. Best, Stephan Duncan Murdoch Best, Tobias P.S. One could even consider a post-install option e.g. to add 'real' R sources (and source references) to Windows packages (which are by definition already 'installed' and for which such information is not by default included in the CRAN binaries of these packages). Prof Brian Ripley wrote: There was an R-core meeting the week before last, and various planned changes will appear in R-devel over the next few weeks. These are changes planned for R 2.14.0 scheduled for Oct 31. As we are sick of people referring to R-devel as '2.14' or
Re: [Rd] Recent and upcoming changes to R-devel
On 07/05/2011 04:21 PM, Duncan Murdoch wrote: On 05/07/2011 10:17 AM, Tobias Verbeke wrote: Dear Duncan, On 07/05/2011 03:25 PM, Duncan Murdoch wrote: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible Can you expand on when files put inside a package at install time will be out of date compared to the source information attached to a function ? I (naively) thought the source information was created and attached at install time as well and that it did not change afterwards either. I guess the arguments for files is that they have precise locations and allow for easy indexing by development tools external to R (but may be corrected here as well). that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. Many thanks for your reaction. Is the R_KEEP_PKG_SOURCE=yes environment variable also supported during R installation ? Yes, other than the error you saw below, which is a temporary problem. Not sure which function exceeded the length limit, but the length limit is going away before 2.14.0 is released. Thanks again, Duncan, for the clarification. Is it useful (or just whimsical) to have an R function that would allow for a given stock CRAN Windows R installation with stock Windows CRAN binary add-on packages to add the source information that would be useful e.g. for a debugger post factum? I can imagine something like update.packages(., checkSourcesKept = TRUE) as I don't think this can currently be solved with a combination of INSTALL_opts=--with-keep.source and type=source given that there will not be a check for the presence of source information to determine which packages require being updated (or in this case 'completed' with source information). The alternative scenario would be to expect users that want this functionality to compile R and all add-on packages from source (also on Windows or Mac). Best, Tobias I hope I'm not overlooking anything, but when compiling ftp://ftp.stat.math.ethz.ch/Software/R/R-devel.tar.gz a few minutes ago I encountered the following issue: [...]
Re: [Rd] [datatable-help] speeding up perception
Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least flagged. For example, I wonder if modest special internal code could store data frames internally and transparently as lists of vectors UNTIL a row name is assigned to. Easier and uglier, a simple but specific warning message could be issued with a suggestion if there is an
Re: [Rd] Recent and upcoming changes to R-devel
On Jul 5, 2011, at 1:45 PM, Tobias Verbeke wrote: On 07/05/2011 04:21 PM, Duncan Murdoch wrote: On 05/07/2011 10:17 AM, Tobias Verbeke wrote: Dear Duncan, On 07/05/2011 03:25 PM, Duncan Murdoch wrote: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible Can you expand on when files put inside a package at install time will be out of date compared to the source information attached to a function ? When you edit such files. I (naively) thought the source information was created and attached at install time as well and that it did not change afterwards either. ... unless you edit it. I guess the arguments for files is that they have precise locations and allow for easy indexing by development tools external to R (but may be corrected here as well). Yes, but the moment you change a file it is no longer reflected in R unless you re-source it. This is usually not an issue if you have a separate installed copy, but if you edit directly on the installed sources (something less frequent with lazy-loaded packages but more so in the old days), the files won't reflect what's actually parsed. This is a common problem, not specific to R, really. By keeping the sources with the objects, you guarantee that they match even if the sources files have been edited - useful for debugging. It not as esoteric as it sounds - just store a function in a workspace and then continue working on a project ... that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. Many thanks for your reaction. Is the R_KEEP_PKG_SOURCE=yes environment variable also supported during R installation ? Yes, other than the error you saw below, which is a temporary problem. Not sure which function exceeded the length limit, but the length limit is going away before 2.14.0 is released. Thanks again, Duncan, for the clarification. Is it useful (or just whimsical) to have an R function that would allow for a given stock CRAN Windows R
Re: [Rd] Recent and upcoming changes to R-devel
On 05/07/2011 1:45 PM, Tobias Verbeke wrote: On 07/05/2011 04:21 PM, Duncan Murdoch wrote: On 05/07/2011 10:17 AM, Tobias Verbeke wrote: Dear Duncan, On 07/05/2011 03:25 PM, Duncan Murdoch wrote: On 05/07/2011 6:52 AM, Tobias Verbeke wrote: L.S. On 07/05/2011 02:16 AM, mark.braving...@csiro.au wrote: I may have misunderstood, but: Please could we have an optional installation that does not*not* byte-compile base and recommended? Reason: it's not possible to debug byte-compiled code-- at least not with the 'debug' package, which is quite widely used. I quite often end up using 'mtrace' on functions in base/recommended packages to figure out what they are doing. And sometimes I (and others) experiment with changing functions in base/recommended to improve functionality. That seems to be harder with BC versions, and might even be impossible, as best I can tell from hints in the documentation of 'compile'). Personally, if I had to choose only one, I'd rather live with the speed penalty from not byte-compiling. But of course, if both are available, I could install both. I completely second this request. All speed improvements and the byte compiler in particular are leaps forward and I am very grateful and admiring towards the people that make this happen. That being said, 'moving away' from the sources (with the lazy loading files and byte-compilation) may be a step back for R package developers that (during development and maybe on separate development installations [as opposed to production installations of R]) require the sources of all packages to be efficient in their work. As many of you know there is an open source Eclipse/StatET visual debugger ready and for that application as well (similar to Mark's request) presence of non-compiled code is highly desirable. For the particular purpose of debugging R packages, I would even plead to go beyond the current options and support the addition of an R package install option that allows to include the sources (e.g. in a standard folder Rsrc/) in installed packages. I am fully aware that one can always fetch the source tarballs from CRAN for that purpose, but it would be much more easy if a simple installation option could put the R sources of a package in a separate folder [or archive inside an existing folder] such that R development tools (such as the Eclipse/StatET IDE) can offer inspection of sources or display them (e.g. during debugging) out of the box. If one has the srcref, one can always load the absolutely correct source code this way, even if one doesn't know the parent function with the source attribute. Any comments? I think these requests have already been met. If you modify the body of a closure (as trace() does), then the byte compiled version is discarded, and you go back to the regular interpreted code. If you install packages with the R_KEEP_PKG_SOURCE=yes environment variable set, the you keep all source for all functions. (It's attached to the function itself, not as a file that may be out of date.) It's possible Can you expand on when files put inside a package at install time will be out of date compared to the source information attached to a function ? Suppose you're debugging. You change a function, source it: now it's not the same as the one in the package source, it's the one in your editor. I (naively) thought the source information was created and attached at install time as well and that it did not change afterwards either. It won't change if the function doesn't change, but during debugging (or in some strange examples, during normal execution) the function might change. I guess the arguments for files is that they have precise locations and allow for easy indexing by development tools external to R (but may be corrected here as well). As in pre-2.13.0, it will keep the locations and time stamps of the files, but we were finding it was too unreliable not to have an actual copy of the contents, so 2.13.0 also keeps a copy of the file, and that's the main source of content to display. that byte compiling turns off R_KEEP_PKG_SOURCE, but that is something that is either easily fixed, or avoided by re-installing without byte compiling. Many thanks for your reaction. Is the R_KEEP_PKG_SOURCE=yes environment variable also supported during R installation ? Yes, other than the error you saw below, which is a temporary problem. Not sure which function exceeded the length limit, but the length limit is going away before 2.14.0 is released. Thanks again, Duncan, for the clarification. Is it useful (or just whimsical) to have an R function that would allow for a given stock CRAN Windows R installation with stock Windows CRAN binary add-on
Re: [Rd] [datatable-help] speeding up perception
On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). Cheers, Simon Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed
Re: [Rd] Syntactically valid names
I wouldn't expect so. The basic structure might be handled using a regexp of sorts, but even that is tricky because of the dot not followed by number rule, and then there's the stop list of reserved words, which would make your code clumsy whatever you do. How on Earth would you expect anything to be significantly more elegant than your function(x) x == make.names(x) anyway??! (OK, if there was a wrapper for the C level isValidName() function...) Good point. Thanks! Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
On Tue, 5 Jul 2011, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? This one is a red herring -- the class(x) - newclass assignment is bumping up the NAMED value and as a result the following assignment needs to duplicate. (the primitive class- could be modified to avoid the NAMED bump but it's fairly intricate code so I'm not going to look into it now). [A bit more later in reply to Simon's message] luke I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least
Re: [Rd] [datatable-help] speeding up perception
On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is bing called in a replacement context or directly. There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers, Simon Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and,
Re: [Rd] Syntactically valid names
On Tue, Jul 5, 2011 at 7:31 PM, steven mosher mosherste...@gmail.com wrote: regexp approach is kinda ugly http://www.r-bloggers.com/testing-for-valid-variable-names/ Hmm, I think that suggests a couple of small bug in make.names: make.names(...) [1] ... make.names(..1) [1] ..1 and x - paste(rep(x, 1e6), collapse = ) x == make.names(x) [1] TRUE Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Syntactically valid names
regexp approach is kinda ugly http://www.r-bloggers.com/testing-for-valid-variable-names/ On Tue, Jul 5, 2011 at 3:29 PM, Hadley Wickham had...@rice.edu wrote: I wouldn't expect so. The basic structure might be handled using a regexp of sorts, but even that is tricky because of the dot not followed by number rule, and then there's the stop list of reserved words, which would make your code clumsy whatever you do. How on Earth would you expect anything to be significantly more elegant than your function(x) x == make.names(x) anyway??! (OK, if there was a wrapper for the C level isValidName() function...) Good point. Thanks! Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Syntactically valid names
On June 30, 2011 01:37:57 PM Hadley Wickham wrote: Is there any easy way to tell if a string is a syntactically valid name? [...] One implementation would be: is.syntactic - function(x) x == make.names(x) but I wonder if there's a more elegant way. This is without quoting, right? Because make.names replaces spaces with periods, and using quoting I can create syntactically valid names that do include spaces: `x prime` - 3 ls() Davor __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Syntactically valid names
This is without quoting, right? Because make.names replaces spaces with periods, and using quoting I can create syntactically valid names that do include spaces: `x prime` - 3 ls() That's not a syntactically valid name - you use backticks to refer to names that are not syntactically valid. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Syntactically valid names
On Jul 6, 2011, at 01:40 , Hadley Wickham wrote: On Tue, Jul 5, 2011 at 7:31 PM, steven mosher mosherste...@gmail.com wrote: regexp approach is kinda ugly http://www.r-bloggers.com/testing-for-valid-variable-names/ Hmm, I think that suggests a couple of small bug in make.names: make.names(...) [1] ... make.names(..1) [1] ..1 What's wrong with that? They are names alright, just with special meanings. x - quote(...) mode(x) [1] name and x - paste(rep(x, 1e6), collapse = ) x == make.names(x) [1] TRUE Mildly insane, but technically OK, no? Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Syntactically valid names
On July 5, 2011 04:59:16 PM Hadley Wickham wrote: That's not a syntactically valid name - you use backticks to refer to names that are not syntactically valid. I was too loose in my terminology: I meant that `x prime` is a valid name, but as you said, it is not syntactically valid. Davor __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is being called in a replacement context or directly. Would $- always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame. -- There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers, Simon -- Luke Tierney Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: l...@stat.uiowa.edu Iowa City, IA 52242 WWW: http:// www.stat.uiowa.edu__ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel David Winsemius, MD West Hartford, CT __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. Cheers, Simon Sent from my iPhone On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote: On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is being called in a replacement context or directly. Would $- always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame. -- There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers, Simon -- Luke Tierney Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: l...@stat.uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu__ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel David Winsemius, MD West Hartford, CT __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Syntactically valid names
What's wrong with that? They are names alright, just with special meanings. But you can't really use them for variables: ... - 4 ... Error: '...' used in an incorrect context ..1 - 4 ..1 Error: 'nthcdr' needs a list to CDR down And make.names generally protects you against that: make.names(function) [1] function. make.names(break) [1] break. make.names(TRUE) [1] TRUE. x - paste(rep(x, 1e6), collapse = ) x == make.names(x) [1] TRUE Mildly insane, but technically OK, no? I don't think so: x - paste(rep(x, 1e6), collapse = ) assign(x, 1) Error in assign(x, 1) : variable names are limited to 1 bytes Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel