[Rd] SET_NAMED in getattrib0
Can someone please set me straight on why getattrib0 calls SET_NAMED on the SEXP it returns? For example the line : SET_NAMED(CAR(s), 2); appears near the end of getattrib0 here : https://svn.r-project.org/R/trunk/src/main/attrib.c https://svn.r-project.org/R/trunk/src/main/attrib.c getattrib() is just reading the value. Shouldn't NAMED be bumped if and when the result of getattrib() is bound to a symbol at R level? Thanks, Matthew [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] declaring package dependencies
On Sep 16, 2013, at 01:46 PM, Brian Rowe wrote: That reminds me: I once made a suggestion on how to automate some of the CRAN deployment process, but it was shot down as not being useful to them. I do recall a quote that was along the lines of as long as you don't need help, do whatever you want, so one thought is to just set up a build server that does the building across the three versions of R, checks dependencies, rebuilds when release, patch, or devel are updated, etc. This would ease the burden on package maintainers and would just happen to make the CRAN folks' lives easier by catching a lot of bad builds. A proof of concept on AWS connecting to github or rforge could probably be finished on a six-pack. Speak up if anyone thinks this would be useful. Yes useful. But that includes a package build system (which is what breaks on R-Forge). If you could do that on a six-pack then could you fix R-Forge on a three-pack first please? The R-Forge build system is itself an open source package on R-Forge. Anyone can look at it, understand it and change it to be more stable. That build system is here : https://r-forge.r-project.org/R/?group_id=34 (I only know this because Stefan told me once. So I suspect others don't know either, or it hasn't sunk in that we're pushing on an open door.) Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] declaring package dependencies
Ben Bolker wrote : Do you happen to remember what the technical difficulty was? From memory I think it was that CRAN maintainers didn't have access to Uwe's winbuilder machine. But often when I get OK from winbuilder R-devel I don't want it to go to CRAN yet. So procedures and software would have to be put in place to handle that (unclear) logic which I didn't propose anything for or offer any code to do. So time and effort to decide and time and effort to implement. Just a guess. And maybe some packages don't run on Windows, so what about those? It's all those edge cases that really take the time. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] helping R-forge build
On 16/09/13 16:11, Paul Gilbert wrote: (subject changed from Re: [Rd] declaring package dependencies ) ... Yes useful. But that includes a package build system (which is what breaks on R-Forge). If you could do that on a six-pack then could you fix R-Forge on a three-pack first please? The R-Forge build system is itself an open source package on R-Forge. Anyone can look at it, understand it and change it to be more stable. That build system is here : https://r-forge.r-project.org/R/?group_id=34 (I only know this because Stefan told me once. So I suspect others don't know either, or it hasn't sunk in that we're pushing on an open door.) Matthew Open code is necessary, but to debug one needs access to logs, etc, to see where it is breaking. Do you know how to find that information? There's a link at the bottom of the R-Forge page to : http://download.r-forge.r-project.org/STATUS I don't know if that's enough but it's a start maybe. I've copied Stefan in case there are more logs somewhere else. (And, BTW, there are also tools to help automatically build R and test packages at http://automater.r-forge.r-project.org/ .) automater looks good! What's the next step? Paul __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] declaring package dependencies
I'm a little surprised by this thread. I subscribe to the RSS feeds of changes to NEWS (as Dirk mentioned) and that's been pretty informative in the past : http://developer.r-project.org/RSSfeeds.html Mainly though, I submit to winbuilder before submitting to CRAN, as the CRAN policies advise. winbuilder's R-devel seems to be built daily, saving me the time. Since I don't have Windows it kills two birds with one stone. It has caught many problems for me before submitting to CRAN and I can't remember it ever not responding in a reasonable time. http://win-builder.r-project.org/upload.aspx I've suggested before that winbuilder could be the mechanism to submit to CRAN rather than an ftp upload to incoming. Only if winbuilder passed OK on R-devel could it then go to a human. But iirc there was a technical difficulty preventing this. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] declaring package dependencies
I'm a little surprised by this thread. I subscribe to the RSS feeds of changes to NEWS (as Dirk mentioned) and that's been pretty informative in the past : http://developer.r-project.org/RSSfeeds.html Mainly though, I submit to winbuilder before submitting to CRAN, as the CRAN policies advise. winbuilder's R-devel seems to be built daily, saving me the time. Since I don't have Windows it kills two birds with one stone. It has caught many problems for me before submitting to CRAN and I can't remember it ever not responding in a reasonable time. http://win-builder.r-project.org/upload.aspx I've suggested before that winbuilder could be the mechanism to submit to CRAN rather than an ftp upload to incoming. Only if winbuilder passed OK on R-devel could it then go to a human. But iirc there was a technical difficulty preventing this. Matthew [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] C API entry point to currentTime()
Hi, I used to use currentTime() (from /src/main/datetime.c) to time various sections of data.table C code in wall clock time in sub-second accuracy (type double), consistently across platforms. The consistency across platforms is a really nice feature of currentTime(). But currentTime() isn't part of R's API so I changed to clock() in order to pass R3 checks. This is nicer in many ways but I'd still like to time elapsed wall clock time as well, since some of the operations are i/o bound. Does R provide a C entry point to currentTime() (or equivalent) suitable for use by packages? I searched r-devel archive and the manuals but may well have missed it. Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] double in summary.c : isum
On 25.03.2013 09:20, Prof Brian Ripley wrote: On 24/03/2013 15:01, Duncan Murdoch wrote: On 13-03-23 10:20 AM, Matthew Dowle wrote: On 23.03.2013 12:01, Prof Brian Ripley wrote: On 20/03/2013 12:56, Matthew Dowle wrote: Hi, Please consider the following : x = as.integer(2^30-1) [1] 1073741823 sum(c(rep(x, 1000), rep(-x,999))) [1] 1073741824 Tested on 2.15.2 and a recent R-devel (r62132). I'm wondering if s in isum could be LDOUBLE instead of double, like rsum, to fix this edge case? No, because there is no guarantee that LDOUBLE differs from double (and platform on which it does not). That's a reason for not using LDOUBLE at all isn't it? Yet src/main/*.c has 19 lines using LDOUBLE e.g. arithmetic.c and cum.c as well as summary.c. I'd assumed LDOUBLE was being used by R to benefit from long double (or equivalent) on platforms that support it (which is all modern Unix, Mac and Windows as far as I know). I do realise that the edge case wouldn't Actually, you don't know. Really only on almost all Intel ix86: most other current CPUs do not have it in hardware. C99/C11 require long double, but does not require the accuracy that you are thinking of and it can be implemented in software. This is very interesting, thanks. Which of the CRAN machines don't support LDOUBLE with higher accuracy than double, either in hardware or software? Yes I had assumed that all CRAN machines would do. It would be useful to know for something else I'm working on as well. be fixed on platforms where LDOUBLE is defined as double. I think the problem is that there are two opposing targets in R: we want things to be as accurate as possible, and we want them to be consistent across platforms. Sometimes one goal wins, sometimes the other. Inconsistencies across platforms give false positives in tests that tend to make us miss true bugs. Some people think we should never use LDOUBLE because of that. In other cases, the extra accuracy is so helpful that it's worth it. So I think you'd need to argue that the case you found is something where the benefit outweighs the costs. Since almost all integer sums are done exactly with the current code, is it really worth introducing inconsistencies in the rare inexact cases? But as I said lower down, a 64-bit integer accumulator would be helpful, C99/C11 requires one at least that large and it is implemented in hardware on all known R platforms. So there is a way to do this pretty consistently across platforms. That sounds much better. Is it just a matter of changing s to be declared as uint64_t? Duncan Murdoch What have I misunderstood? Users really need to take responsibility for the numerical stability of calcuations they attempt. Expecting to sum 20 million large numbers exactly is unrealistic. Trying to take responsibility, but you said no. Changing from double to LDOUBLE would mean that something that wasn't realistic, was then realistic (on platforms that support long double). And it would bring open source R into line with TERR, which gets the answer right, on 64bit Windows at least. But I'm not sure I should be as confident in TERR as I am in open source R because I can't see its source code. There are cases where 64-bit integer accumulators would be beneficial, and this is one. Unfortunately C11 does not require them but some optional moves in that direction are planned. https://svn.r-project.org/R/trunk/src/main/summary.c Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] double in summary.c : isum
On 25.03.2013 11:27, Matthew Dowle wrote: On 25.03.2013 09:20, Prof Brian Ripley wrote: On 24/03/2013 15:01, Duncan Murdoch wrote: On 13-03-23 10:20 AM, Matthew Dowle wrote: On 23.03.2013 12:01, Prof Brian Ripley wrote: On 20/03/2013 12:56, Matthew Dowle wrote: Hi, Please consider the following : x = as.integer(2^30-1) [1] 1073741823 sum(c(rep(x, 1000), rep(-x,999))) [1] 1073741824 Tested on 2.15.2 and a recent R-devel (r62132). I'm wondering if s in isum could be LDOUBLE instead of double, like rsum, to fix this edge case? No, because there is no guarantee that LDOUBLE differs from double (and platform on which it does not). That's a reason for not using LDOUBLE at all isn't it? Yet src/main/*.c has 19 lines using LDOUBLE e.g. arithmetic.c and cum.c as well as summary.c. I'd assumed LDOUBLE was being used by R to benefit from long double (or equivalent) on platforms that support it (which is all modern Unix, Mac and Windows as far as I know). I do realise that the edge case wouldn't Actually, you don't know. Really only on almost all Intel ix86: most other current CPUs do not have it in hardware. C99/C11 require long double, but does not require the accuracy that you are thinking of and it can be implemented in software. This is very interesting, thanks. Which of the CRAN machines don't support LDOUBLE with higher accuracy than double, either in hardware or software? Yes I had assumed that all CRAN machines would do. It would be useful to know for something else I'm working on as well. be fixed on platforms where LDOUBLE is defined as double. I think the problem is that there are two opposing targets in R: we want things to be as accurate as possible, and we want them to be consistent across platforms. Sometimes one goal wins, sometimes the other. Inconsistencies across platforms give false positives in tests that tend to make us miss true bugs. Some people think we should never use LDOUBLE because of that. In other cases, the extra accuracy is so helpful that it's worth it. So I think you'd need to argue that the case you found is something where the benefit outweighs the costs. Since almost all integer sums are done exactly with the current code, is it really worth introducing inconsistencies in the rare inexact cases? But as I said lower down, a 64-bit integer accumulator would be helpful, C99/C11 requires one at least that large and it is implemented in hardware on all known R platforms. So there is a way to do this pretty consistently across platforms. That sounds much better. Is it just a matter of changing s to be declared as uint64_t? Typo. I meant int64_t. Duncan Murdoch What have I misunderstood? Users really need to take responsibility for the numerical stability of calcuations they attempt. Expecting to sum 20 million large numbers exactly is unrealistic. Trying to take responsibility, but you said no. Changing from double to LDOUBLE would mean that something that wasn't realistic, was then realistic (on platforms that support long double). And it would bring open source R into line with TERR, which gets the answer right, on 64bit Windows at least. But I'm not sure I should be as confident in TERR as I am in open source R because I can't see its source code. There are cases where 64-bit integer accumulators would be beneficial, and this is one. Unfortunately C11 does not require them but some optional moves in that direction are planned. https://svn.r-project.org/R/trunk/src/main/summary.c Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] double in summary.c : isum
On 25.03.2013 11:31, Matthew Dowle wrote: On 25.03.2013 11:27, Matthew Dowle wrote: On 25.03.2013 09:20, Prof Brian Ripley wrote: On 24/03/2013 15:01, Duncan Murdoch wrote: On 13-03-23 10:20 AM, Matthew Dowle wrote: On 23.03.2013 12:01, Prof Brian Ripley wrote: On 20/03/2013 12:56, Matthew Dowle wrote: Hi, Please consider the following : x = as.integer(2^30-1) [1] 1073741823 sum(c(rep(x, 1000), rep(-x,999))) [1] 1073741824 Tested on 2.15.2 and a recent R-devel (r62132). I'm wondering if s in isum could be LDOUBLE instead of double, like rsum, to fix this edge case? No, because there is no guarantee that LDOUBLE differs from double (and platform on which it does not). That's a reason for not using LDOUBLE at all isn't it? Yet src/main/*.c has 19 lines using LDOUBLE e.g. arithmetic.c and cum.c as well as summary.c. I'd assumed LDOUBLE was being used by R to benefit from long double (or equivalent) on platforms that support it (which is all modern Unix, Mac and Windows as far as I know). I do realise that the edge case wouldn't Actually, you don't know. Really only on almost all Intel ix86: most other current CPUs do not have it in hardware. C99/C11 require long double, but does not require the accuracy that you are thinking of and it can be implemented in software. This is very interesting, thanks. Which of the CRAN machines don't support LDOUBLE with higher accuracy than double, either in hardware or software? Yes I had assumed that all CRAN machines would do. It would be useful to know for something else I'm working on as well. be fixed on platforms where LDOUBLE is defined as double. I think the problem is that there are two opposing targets in R: we want things to be as accurate as possible, and we want them to be consistent across platforms. Sometimes one goal wins, sometimes the other. Inconsistencies across platforms give false positives in tests that tend to make us miss true bugs. Some people think we should never use LDOUBLE because of that. In other cases, the extra accuracy is so helpful that it's worth it. So I think you'd need to argue that the case you found is something where the benefit outweighs the costs. Since almost all integer sums are done exactly with the current code, is it really worth introducing inconsistencies in the rare inexact cases? But as I said lower down, a 64-bit integer accumulator would be helpful, C99/C11 requires one at least that large and it is implemented in hardware on all known R platforms. So there is a way to do this pretty consistently across platforms. That sounds much better. Is it just a matter of changing s to be declared as uint64_t? Typo. I meant int64_t. But even 64-bit integer might under or overflow. Which is one of the reasons for accumulating in double (or LDOUBLE) isn't it? To save a test for over/underflow on each iteration. Duncan Murdoch What have I misunderstood? Users really need to take responsibility for the numerical stability of calcuations they attempt. Expecting to sum 20 million large numbers exactly is unrealistic. Trying to take responsibility, but you said no. Changing from double to LDOUBLE would mean that something that wasn't realistic, was then realistic (on platforms that support long double). And it would bring open source R into line with TERR, which gets the answer right, on 64bit Windows at least. But I'm not sure I should be as confident in TERR as I am in open source R because I can't see its source code. There are cases where 64-bit integer accumulators would be beneficial, and this is one. Unfortunately C11 does not require them but some optional moves in that direction are planned. https://svn.r-project.org/R/trunk/src/main/summary.c Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] conflict between rJava and data.table
Simon Urbanek wrote : Can you elaborate on the details as of where this will be a problem? Packages should not be affected since they should be importing the namespaces from the packages they use, so the only problem would be in a package that uses both data.table and rJava -- and this is easily resolved in the namespace of such package. So there is no technical reason why you can't have multiple definitions of J - that's what namespaces are for. Right. It's users using J() in their own code, iiuc. rJava's manual says J is the high-level access to Java. When they use J() on its own they probably want the rJava one, but if data.table is higher they get that one. They don't want to have to write out rJava::J(...). It is not just rJava but package XLConnect, too. If there's a better way would be interested but I didn't mind removing J from data.table. Bunny/Matt, To add to Steve's reply here's some background. This is well documented in NEWS and Googling data.table J rJava and similar returns useful links to NEWS and datatable-help (so you shouldn't have needed to post to r-devel). From 1.8.2 (Jul 2012) : o The J() alias is now deprecated outside DT[...], but will still work inside DT[...], as in DT[J(...)]. J() is conflicting with function J() in package XLConnect (#1747) and rJava (#2045). For data.table to change is easier, with some efficiency advantages too. The next version of data.table will issue a warning from J() when used outside DT[...]. The version after will remove it. Only then will the conflict with rJava and XLConnect be resolved. Please use data.table() directly instead of J(), outside DT[...]. From 1.8.4 (Nov 2012) : o J() now issues a warning (when used *outside* DT[...]) that using it outside DT[...] is deprecated. See item below in v1.8.2. Use data.table() directly instead of J(), outside DT[...]. Or, define an alias yourself. J() will continue to work *inside* DT[...] as documented. From 1.8.7 (soon to be on CRAN) : o The J() alias is now removed *outside* DT[...], but will still work inside DT[...]; i.e., DT[J(...)] is fine. As warned in v1.8.2 (see below in this file) and deprecated with warning() in v1.8.6. This resolves the conflict with function J() in package XLConnect (#1747) and rJava (#2045). Please use data.table() directly instead of J(), outside DT[...]. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] conflict between rJava and data.table
On 01.03.2013 16:13, Simon Urbanek wrote: On Mar 1, 2013, at 8:03 AM, Matthew Dowle wrote: Simon Urbanek wrote : Can you elaborate on the details as of where this will be a problem? Packages should not be affected since they should be importing the namespaces from the packages they use, so the only problem would be in a package that uses both data.table and rJava -- and this is easily resolved in the namespace of such package. So there is no technical reason why you can't have multiple definitions of J - that's what namespaces are for. Right. It's users using J() in their own code, iiuc. rJava's manual says J is the high-level access to Java. When they use J() on its own they probably want the rJava one, but if data.table is higher they get that one. They don't want to have to write out rJava::J(...). It is not just rJava but package XLConnect, too. If there's a better way would be interested but I didn't mind removing J from data.table. For packages there is really no issue - if something breaks in XTConnect then the authors are probably importing the wrong function in their namespace (I still didn't see a reproducible example, though). The only difference is for interactive use so not having conflicting J() [if possible] would be actually useful there, since J() in rJava is primarily intended for interactive use. Yes that's what I wrote above isn't it? i.e. It's users using J() in their own code, iiuc. J is the high-level access to Java. Not just interactive use (i.e. at the R prompt) but inside their functions and scripts, too. Although, I don't know the rJava package at all. So why J() might be used for interactive use but not in functions and scripts isn't clear to me. Any use of J from example(J) will serve as a reproducible example; e.g., library(rJava) # load rJava first library(data.table) # then data.table J(java.lang.Double) There is no error or warning, but the user would be returned a 1 row 1 column data.table rather than something related to Java. Then the errors/warnings follow from there. The user can either load the packages the other way around, or, use :: library(rJava) # load rJava first library(data.table) # then data.table rJava::J(java.lang.Double)# ok now Cheers, Simon Bunny/Matt, To add to Steve's reply here's some background. This is well documented in NEWS and Googling data.table J rJava and similar returns useful links to NEWS and datatable-help (so you shouldn't have needed to post to r-devel). From 1.8.2 (Jul 2012) : o The J() alias is now deprecated outside DT[...], but will still work inside DT[...], as in DT[J(...)]. J() is conflicting with function J() in package XLConnect (#1747) and rJava (#2045). For data.table to change is easier, with some efficiency advantages too. The next version of data.table will issue a warning from J() when used outside DT[...]. The version after will remove it. Only then will the conflict with rJava and XLConnect be resolved. Please use data.table() directly instead of J(), outside DT[...]. From 1.8.4 (Nov 2012) : o J() now issues a warning (when used *outside* DT[...]) that using it outside DT[...] is deprecated. See item below in v1.8.2. Use data.table() directly instead of J(), outside DT[...]. Or, define an alias yourself. J() will continue to work *inside* DT[...] as documented. From 1.8.7 (soon to be on CRAN) : o The J() alias is now removed *outside* DT[...], but will still work inside DT[...]; i.e., DT[J(...)] is fine. As warned in v1.8.2 (see below in this file) and deprecated with warning() in v1.8.6. This resolves the conflict with function J() in package XLConnect (#1747) and rJava (#2045). Please use data.table() directly instead of J(), outside DT[...]. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] conflict between rJava and data.table
On 01.03.2013 20:19, Simon Urbanek wrote: On Mar 1, 2013, at 11:40 AM, Matthew Dowle wrote: On 01.03.2013 16:13, Simon Urbanek wrote: On Mar 1, 2013, at 8:03 AM, Matthew Dowle wrote: Simon Urbanek wrote : Can you elaborate on the details as of where this will be a problem? Packages should not be affected since they should be importing the namespaces from the packages they use, so the only problem would be in a package that uses both data.table and rJava -- and this is easily resolved in the namespace of such package. So there is no technical reason why you can't have multiple definitions of J - that's what namespaces are for. Right. It's users using J() in their own code, iiuc. rJava's manual says J is the high-level access to Java. When they use J() on its own they probably want the rJava one, but if data.table is higher they get that one. They don't want to have to write out rJava::J(...). It is not just rJava but package XLConnect, too. If there's a better way would be interested but I didn't mind removing J from data.table. For packages there is really no issue - if something breaks in XTConnect then the authors are probably importing the wrong function in their namespace (I still didn't see a reproducible example, though). The only difference is for interactive use so not having conflicting J() [if possible] would be actually useful there, since J() in rJava is primarily intended for interactive use. Yes that's what I wrote above isn't it? i.e. It's users using J() in their own code, iiuc. J is the high-level access to Java. Not just interactive use (i.e. at the R prompt) but inside their functions and scripts, too. Although, I don't know the rJava package at all. So why J() might be used for interactive use but not in functions and scripts isn't clear to me. Any use of J from example(J) will serve as a reproducible example; e.g., library(rJava) # load rJava first library(data.table) # then data.table J(java.lang.Double) There is no error or warning, but the user would be returned a 1 row 1 column data.table rather than something related to Java. Then the errors/warnings follow from there. The user can either load the packages the other way around, or, use :: library(rJava) # load rJava first library(data.table) # then data.table rJava::J(java.lang.Double)# ok now Matt, there are two entirely separate uses a) interactive use b) use in packages you are describing a) and as I said in the latter part above J() in rJava is meant for that so it would be useful to not have a conflict there. Yes (a) is the problem. Good, so I did the right thing in July 2012 by starting to deprecate J in data.table when this problem was first reported. However, in my first part of the e-mail I was referring to b) where there is no conflict, because packages define which package will a symbol come from, so the user search path plays no role. Today, all packages should be using imports so search path pollution should no longer be an issue, so the order in which the user attached packages to their search path won't affect the functionality of the packages (that's why namespaces are mandatory). Therefore, if XLConnect breaks (again, I don't know, I didn't see it) due to the order on the search path, it indicates there is a bug in the its namespace as it's apparently importing the wrong J - it should be importing it from rJava and not data.table. Is that more clear? Yes, thanks. (b) isn't a problem. rJava and XLConnect aren't breaking, the users aren't reporting that. It's merely problem (a); e.g. where end users of both rJava and data.table use J() in their own code. Cheers, Simon Cheers, Simon Bunny/Matt, To add to Steve's reply here's some background. This is well documented in NEWS and Googling data.table J rJava and similar returns useful links to NEWS and datatable-help (so you shouldn't have needed to post to r-devel). From 1.8.2 (Jul 2012) : o The J() alias is now deprecated outside DT[...], but will still work inside DT[...], as in DT[J(...)]. J() is conflicting with function J() in package XLConnect (#1747) and rJava (#2045). For data.table to change is easier, with some efficiency advantages too. The next version of data.table will issue a warning from J() when used outside DT[...]. The version after will remove it. Only then will the conflict with rJava and XLConnect be resolved. Please use data.table() directly instead of J(), outside DT[...]. From 1.8.4 (Nov 2012) : o J() now issues a warning (when used *outside* DT[...]) that using it outside DT[...] is deprecated. See item below in v1.8.2. Use data.table() directly instead of J(), outside DT[...]. Or, define an alias yourself. J() will continue to work *inside* DT[...] as documented. From 1.8.7 (soon to be on CRAN) : o The J() alias is now removed *outside* DT[...], but will still work inside
Re: [Rd] Implications of a Dependency on a GPLed Package
Christian, In my mind, rightly or wrongly, it boils down to these four points : 1. CRAN policy excludes closed source packages; i.e., every single package on CRAN includes its C code, if any. If an R package included a .dll or .so which linked at C level to R, and that was being distributed without providing the source, then that would be a clear breach of R's GPL. But nobody is aware of any such package. Anyone who is aware of that should let the R Foundation know. Whether or not the GPL applies to R only interpreted code (by definition you cannot close-source interpreted code) is important too, but not as important as distribution of closed source binaries linking to R at C level. 2. Court cases would never happen unless two lawyers disagreed. Even then two judges can disagree (otherwise appeals would never be successful). 3. There are two presidents of the R Foundation. And it appears they disagree. Therefore it appears very unlikely that the R Foundation would bring a GPL case against anyone. Rather, it seems to be up to the community to decide for themselves. If you don't mind about close source non free software linking to R at C level then buy it (if that exists), if you don't, don't. 4. As a package author it is entirely up to you how to approach this area. Yes, seek legal advice. And I'd suggest seeking the advice of several lawyers, not just one. Then follow the advice that you like the best. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bounty on Error Checking
On Fri, Jan 3, 2013, Bert Gunter wrote Well... On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at anderson.ucla.edu wrote: Dear R developers---I just spent half a day debugging an R program, which had two bugs---I selected the wrongly named variable, which turns out to have been a scalar, which then happily multiplied as if it was a matrix; and another wrongly named variable from a data frame, that triggered no error when used as a[[name]] or a$name . there should be an option to turn on that throws an error inside R when one does this. I cannot imagine that there is much code that wants to reference non-existing columns in data frames. But I can -- and do it all the time: To add a new variable, d to a data frame, df, containing only a and b (with 10 rows, say): df[[d]] - 1:10 Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e., select only not assign, if I understood correctly. Trying to outguess documentation to create error triggers is a very bad idea. Why exactly is it a very bad idea? (I don't necessarily disagree, just asking for more colour.) R already has plenty of debugging tools -- and there is even a debug package. Perhaps you need a better programming editor/IDE. There are several listed on CRAN, RStudio, etc. True, but that relies on you knowing there's a bug to hunt for. What if you don't know you're getting incorrect results, silently? In a similar way that options(warn=2) turns known warnings into errors, to enable you to be more strict if you wish, an option to turn on warnings from `[[` and `$` if the column is missing (select only, not assign) doesn't seem like a bad option to have. Maybe it would reveal some previously silent bugs. Anyway, I'm hoping Ivo will let us know if he likes the simple mask I proposed, or not. That's already an option that can be turned on or off. But if his bug was selecting the wrong column, not a missing one, then I'm not sure anything could (or needs to be) done about that. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bounty on Error Checking
On 04.01.2013 14:03, Duncan Murdoch wrote: On 13-01-04 8:32 AM, Matthew Dowle wrote: On Fri, Jan 3, 2013, Bert Gunter wrote Well... On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at anderson.ucla.edu wrote: Dear R developers---I just spent half a day debugging an R program, which had two bugs---I selected the wrongly named variable, which turns out to have been a scalar, which then happily multiplied as if it was a matrix; and another wrongly named variable from a data frame, that triggered no error when used as a[[name]] or a$name . there should be an option to turn on that throws an error inside R when one does this. I cannot imagine that there is much code that wants to reference non-existing columns in data frames. But I can -- and do it all the time: To add a new variable, d to a data frame, df, containing only a and b (with 10 rows, say): df[[d]] - 1:10 Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e., select only not assign, if I understood correctly. Trying to outguess documentation to create error triggers is a very bad idea. Why exactly is it a very bad idea? (I don't necessarily disagree, just asking for more colour.) R already has plenty of debugging tools -- and there is even a debug package. Perhaps you need a better programming editor/IDE. There are several listed on CRAN, RStudio, etc. True, but that relies on you knowing there's a bug to hunt for. What if you don't know you're getting incorrect results, silently? In a similar way that options(warn=2) turns known warnings into errors, to enable you to be more strict if you wish, I would say the point of options(warn=2) is rather to let you find the location of the warning more easily, because it will abort the evaluation. True but as well as that, I sometimes like to run production systems with options(warn=2). I'd prefer some tasks to halt at the slightest hint of trouble than write a warning silently to a log file that may not be looked at. I think of that as being more strict, more robust. Since option(warn=2) is set even when there is no warning, to catch if one arises in future. Not just to find it more easily once you know there is a warning. I would not recommend using code that issues warnings. Not sure what you mean here. an option to turn on warnings from `[[` and `$` if the column is missing (select only, not assign) doesn't seem like a bad option to have. Maybe it would reveal some previously silent bugs. I agree that this would sometimes be useful, but a very common convention is to do something like if (is.null(obj$element)) { do something } These would all have to be re-written to something like if (missing.field(obj, element) { do something } There are several hundred examples of the first usage in base R; I imagine thousands more in contributed packages. Yes but Ivo doesn't seem to be writing that if() in his code. We're only talking about an option that users can turn on for their own code, iiuc. Not anything that would affect or break thousands of packages. That's why I referred to the fact that all packages now have namespaces, in the earlier post. I don't think the benefit of the change is worth all the work that would be necessary to implement it. It doesn't seem to be a lot of work. I already posted a working straw man, for example, as a first step. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bounty on Error Checking
On 04.01.2013 14:56, Duncan Murdoch wrote: On 04/01/2013 9:51 AM, Matthew Dowle wrote: On 04.01.2013 14:03, Duncan Murdoch wrote: On 13-01-04 8:32 AM, Matthew Dowle wrote: On Fri, Jan 3, 2013, Bert Gunter wrote Well... On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at anderson.ucla.edu wrote: Dear R developers---I just spent half a day debugging an R program, which had two bugs---I selected the wrongly named variable, which turns out to have been a scalar, which then happily multiplied as if it was a matrix; and another wrongly named variable from a data frame, that triggered no error when used as a[[name]] or a$name . there should be an option to turn on that throws an error inside R when one does this. I cannot imagine that there is much code that wants to reference non-existing columns in data frames. But I can -- and do it all the time: To add a new variable, d to a data frame, df, containing only a and b (with 10 rows, say): df[[d]] - 1:10 Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e., select only not assign, if I understood correctly. Trying to outguess documentation to create error triggers is a very bad idea. Why exactly is it a very bad idea? (I don't necessarily disagree, just asking for more colour.) R already has plenty of debugging tools -- and there is even a debug package. Perhaps you need a better programming editor/IDE. There are several listed on CRAN, RStudio, etc. True, but that relies on you knowing there's a bug to hunt for. What if you don't know you're getting incorrect results, silently? In a similar way that options(warn=2) turns known warnings into errors, to enable you to be more strict if you wish, I would say the point of options(warn=2) is rather to let you find the location of the warning more easily, because it will abort the evaluation. True but as well as that, I sometimes like to run production systems with options(warn=2). I'd prefer some tasks to halt at the slightest hint of trouble than write a warning silently to a log file that may not be looked at. I think of that as being more strict, more robust. Since option(warn=2) is set even when there is no warning, to catch if one arises in future. Not just to find it more easily once you know there is a warning. I would not recommend using code that issues warnings. Not sure what you mean here. I just meant that I consider warnings to be a problem (as you do), so they should all be fixed. I see now, good. an option to turn on warnings from `[[` and `$` if the column is missing (select only, not assign) doesn't seem like a bad option to have. Maybe it would reveal some previously silent bugs. I agree that this would sometimes be useful, but a very common convention is to do something like if (is.null(obj$element)) { do something } These would all have to be re-written to something like if (missing.field(obj, element) { do something } There are several hundred examples of the first usage in base R; I imagine thousands more in contributed packages. Yes but Ivo doesn't seem to be writing that if() in his code. We're only talking about an option that users can turn on for their own code, iiuc. Not anything that would affect or break thousands of packages. That's why I referred to the fact that all packages now have namespaces, in the earlier post. I don't think the benefit of the change is worth all the work that would be necessary to implement it. It doesn't seem to be a lot of work. I already posted a working straw man, for example, as a first step. I understood the proposal to be that evaluating obj$element would issue a warning if element didn't exist. If that were the case, then the common test is.null(obj$element) would issue a warning in the cases where it now returns TRUE. Yes, but only for obj$element appearing in Ivo's own code. Not if a package does that (including base). That's why I thought masking [[- and $- in .GlobalEnv might achieve that without affecting packages or base, although I don't know how such an option could be made available by R. Maybe options(strictselect=TRUE) would create those masks in .GlobalEnv, and options(strictselect=FALSE) would remove them. A package maintainer might choose to set that in their package to make it stricter (which would create those masks in the package's namespace too). Or users could just create those masks themselves, since it's only a few lines. Without affecting packages or base. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bounty on Error Checking
On 04.01.2013 15:22, Duncan Murdoch wrote: On 04/01/2013 10:15 AM, Matthew Dowle wrote: On 04.01.2013 14:56, Duncan Murdoch wrote: On 04/01/2013 9:51 AM, Matthew Dowle wrote: On 04.01.2013 14:03, Duncan Murdoch wrote: On 13-01-04 8:32 AM, Matthew Dowle wrote: On Fri, Jan 3, 2013, Bert Gunter wrote Well... On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at anderson.ucla.edu wrote: Dear R developers---I just spent half a day debugging an R program, which had two bugs---I selected the wrongly named variable, which turns out to have been a scalar, which then happily multiplied as if it was a matrix; and another wrongly named variable from a data frame, that triggered no error when used as a[[name]] or a$name . there should be an option to turn on that throws an error inside R when one does this. I cannot imagine that there is much code that wants to reference non-existing columns in data frames. But I can -- and do it all the time: To add a new variable, d to a data frame, df, containing only a and b (with 10 rows, say): df[[d]] - 1:10 Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e., select only not assign, if I understood correctly. Trying to outguess documentation to create error triggers is a very bad idea. Why exactly is it a very bad idea? (I don't necessarily disagree, just asking for more colour.) R already has plenty of debugging tools -- and there is even a debug package. Perhaps you need a better programming editor/IDE. There are several listed on CRAN, RStudio, etc. True, but that relies on you knowing there's a bug to hunt for. What if you don't know you're getting incorrect results, silently? In a similar way that options(warn=2) turns known warnings into errors, to enable you to be more strict if you wish, I would say the point of options(warn=2) is rather to let you find the location of the warning more easily, because it will abort the evaluation. True but as well as that, I sometimes like to run production systems with options(warn=2). I'd prefer some tasks to halt at the slightest hint of trouble than write a warning silently to a log file that may not be looked at. I think of that as being more strict, more robust. Since option(warn=2) is set even when there is no warning, to catch if one arises in future. Not just to find it more easily once you know there is a warning. I would not recommend using code that issues warnings. Not sure what you mean here. I just meant that I consider warnings to be a problem (as you do), so they should all be fixed. I see now, good. an option to turn on warnings from `[[` and `$` if the column is missing (select only, not assign) doesn't seem like a bad option to have. Maybe it would reveal some previously silent bugs. I agree that this would sometimes be useful, but a very common convention is to do something like if (is.null(obj$element)) { do something } These would all have to be re-written to something like if (missing.field(obj, element) { do something } There are several hundred examples of the first usage in base R; I imagine thousands more in contributed packages. Yes but Ivo doesn't seem to be writing that if() in his code. We're only talking about an option that users can turn on for their own code, iiuc. Not anything that would affect or break thousands of packages. That's why I referred to the fact that all packages now have namespaces, in the earlier post. I don't think the benefit of the change is worth all the work that would be necessary to implement it. It doesn't seem to be a lot of work. I already posted a working straw man, for example, as a first step. I understood the proposal to be that evaluating obj$element would issue a warning if element didn't exist. If that were the case, then the common test is.null(obj$element) would issue a warning in the cases where it now returns TRUE. Yes, but only for obj$element appearing in Ivo's own code. Not if a package does that (including base). That's why I thought masking [[- and $- in .GlobalEnv might achieve that without affecting packages or base, although I don't know how such an option could be made available by R. Maybe options(strictselect=TRUE) would create those masks in .GlobalEnv, and options(strictselect=FALSE) would remove them. A package maintainer might choose to set that in their package to make it stricter (which would create those masks in the package's namespace too). Or users could just create those masks themselves, since it's only a few lines. Without affecting packages or base. options() are global I realise that. I was thinking that inside the options() function it could see if strictselect was being changed and then create the masks in .GlobalEnv. But I can see that is ugly
Re: [Rd] How to ensure -O3 on Win64
On 28.12.2012 00:41, Simon Urbanek wrote: On Dec 27, 2012, at 6:08 PM, Matthew Dowle wrote: On 27.12.2012 17:53, Simon Urbanek wrote: On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote: Hi, Similar questions have come up before on the list and elsewhere but I haven't found a solution yet. winbuilder's install.out shows data.table's .c files compiled with -O3 on Win32 but -O2 on Win64. The same happens on R-Forge. I gather that some packages don't work with -O3 so the default is -O2. I've tried this in data.table's Makevars (entire contents) : MAKEFLAGS=CFLAGS=-O3# added CFLAGS=-O3# added PKG_CFLAGS=-O3# added all: $(SHLIB) # no change mv $(SHLIB) datatable$(SHLIB_EXT) # no change but -O2 still appears in winbuilder's install.out (after -O3, and I believe the last -O is the one that counts) : gcc -m64 -ID:/RCompile/recent/R-2.15.2/include -DNDEBUG -Id:/Rcompile/CRANpkg/extralibs215/local215/include -O3 -O2 -Wall -std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o How can I ensure that data.table is compiled with -O3 on Win64? You can't - at least not in a way that doesn't circumvent the R build system. Also it's not portable so you don't want to mess with optimization flags and hard-code it in your package as it's user's choice how they setup R and its flags. You can certainly setup your R to compile with -O3, you just can't impose that on others. Cheers, Simon Thanks Simon. This makes complete sense where users compile packages on install (Unix and Mac, and I better check my settings then), but on Windows where it's more common for the user to install the pre-compiled .zip from CRAN is my concern. This came up because the new fread function in data.table wasn't showing as much of a speedup on Win64 as on Linux. I'm not 100% sure that non -O3 is the cause, but there are some function calls which get iterated a lot (e.g. isspace) and I'd seen that inlining was something -O3 did and not -O2. In general, why wouldn't a user of a package want the best performance from -O3? Because it doesn't work? I don't know, you said yourself that -O2 may be there since -O3 breaks - that was not the question, though. (If you are curious about that, ask on CRAN, I don't remember the answer -- note that Win64 compiler support is relatively recent). Indeed I had forgotten how recent that was. Ok, this is clicking now. By non portable do you mean the executable produced by winbuilder (or by CRAN) might not run on all Windows machines it's installed on (because -O3 (over) optimizes for the machine it's built on), or do you mean that -O3 itself might not be available on some compilers (and if so which compilers don't have -O3?). Non-portable as in -O3 may not be supported or may break (we have seen -O3 trigger bugs in gcc before). If you hard-code it, there is no way around it. The point is that you cannot make decisions for the user in advance, because you don't know the setup the user may use. I agree that Windows is a bit of a special-case in that there are very few choices so the risk of breaking things is lower, but if -O2 is really such a big deal, it is not just your problem and so you may want to investigate it further. Ok thanks a lot for info. I'll try a few more things and follow up off r-devel if need be. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to ensure -O3 on Win64
On 27.12.2012 17:53, Simon Urbanek wrote: On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote: Hi, Similar questions have come up before on the list and elsewhere but I haven't found a solution yet. winbuilder's install.out shows data.table's .c files compiled with -O3 on Win32 but -O2 on Win64. The same happens on R-Forge. I gather that some packages don't work with -O3 so the default is -O2. I've tried this in data.table's Makevars (entire contents) : MAKEFLAGS=CFLAGS=-O3# added CFLAGS=-O3# added PKG_CFLAGS=-O3# added all: $(SHLIB) # no change mv $(SHLIB) datatable$(SHLIB_EXT) # no change but -O2 still appears in winbuilder's install.out (after -O3, and I believe the last -O is the one that counts) : gcc -m64 -ID:/RCompile/recent/R-2.15.2/include -DNDEBUG -Id:/Rcompile/CRANpkg/extralibs215/local215/include -O3 -O2 -Wall -std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o How can I ensure that data.table is compiled with -O3 on Win64? You can't - at least not in a way that doesn't circumvent the R build system. Also it's not portable so you don't want to mess with optimization flags and hard-code it in your package as it's user's choice how they setup R and its flags. You can certainly setup your R to compile with -O3, you just can't impose that on others. Cheers, Simon Thanks Simon. This makes complete sense where users compile packages on install (Unix and Mac, and I better check my settings then), but on Windows where it's more common for the user to install the pre-compiled .zip from CRAN is my concern. This came up because the new fread function in data.table wasn't showing as much of a speedup on Win64 as on Linux. I'm not 100% sure that non -O3 is the cause, but there are some function calls which get iterated a lot (e.g. isspace) and I'd seen that inlining was something -O3 did and not -O2. In general, why wouldn't a user of a package want the best performance from -O3? By non portable do you mean the executable produced by winbuilder (or by CRAN) might not run on all Windows machines it's installed on (because -O3 (over) optimizes for the machine it's built on), or do you mean that -O3 itself might not be available on some compilers (and if so which compilers don't have -O3?). Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv reads more rows than indicated by wc -l
Ben, Somewhere on my wish/TO DO list is for someone to rewrite read.table for better robustness *and* efficiency ... Wish granted. New in data.table 1.8.7 : = New function fread(), a fast and friendly file reader. * header, skip, nrows, sep and colClasses are all auto detected. * integers2^31 are detected and read natively as bit64::integer64. * accepts filenames, URLs and A,B\n1,2\n3,4 directly * new implementation entirely in C * with a 50MB .csv, 1 million rows x 6 columns : read.csv(test.csv)# 30-60 sec read.table(test.csv,all known tricks and known nrows) # 10 sec fread(test.csv) # 3 sec * airline data: 658MB csv (7 million rows x 29 columns) read.table(2008.csv,all known tricks and known nrows) # 360 sec fread(2008.csv) # 50 sec See ?fread. Many thanks to Chris Neff and Garrett See for ideas, discussions and beta testing. = The help page ?fread is fairly well developed : https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markuproot=datatable Comments, feedback and bug reports very welcome. Matthew http://datatable.r-forge.r-project.org/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] built-in NAMED(obj) from within R
Benjamin Tyner btyner at gmail.com writes: Hello, Is it possible to retrieve the 'named' field within the header (sxpinfo) of a object, without resorting to a debugger, external code, etc? And much more than just NAMED : .Internal(inspect(x)) The goal is to ascertain whether a copy of an object has been made. Then : ?tracemem One demonstration of using both together is here : http://stackoverflow.com/a/10312843/403310 Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] There is pmin and pmax each taking na.rm, how about psum?
On Sun, Nov 4, 2012 at 6:35 AM, Justin Talbot jtal...@stanford.edu wrote: Then the case for psum is more for convenience and speed -vs- colSums(rbind(x,y), na.rm=TRUE)), since rbind will copy x and y into a new matrix. The case for pprod is similar, plus colProds doesn't exist. Right, and consistency; for what that's worth. Thus, + should have the signature: `+`(..., na.rm=FALSE), which would allow you to do things like: `+`(c(1,2),c(1,2),c(1,2),NA, na.rm=TRUE) = c(3,6) If you don't like typing `+`, you could always alias psum to `+`. But there would be a cost, wouldn't there? `+` is a dyadic .Primitive. Changing that to take `...` and `na.rm` could slow it down (iiuc), and any changes to the existing language are risky. For example : `+`(1,2,3) is currently an error. Changing that to do something might have implications for some of the 4,000 packages (some might rely on that being an error), with a possible speed cost too. There would be a very slight performance cost for the current interpreter. For the new bytecode compiler though there would be no performance cost since the common binary form can be detected at compile time and an optimized bytecode can be emitted for it. Taking what's currently an error and making it legal is a pretty safe change; unless someone is currently relying on `+`(1,2,3) to return an error, which I doubt. I think the bigger question on making this change work would be on the S3 dispatch logic. I don't understand the intricacies of S3 well enough to know if this change is plausible or not. Interesting. Sounds more possible than I thought. In contrast, adding two functions that didn't exist before: psum and pprod, seems to be a safer and simpler proposition. Definitely easier. Leaves the language a bit more complicated, but that might be the right trade off. I would strongly suggest adding pany and pall as well. I find myself wishing for them all the time. prange would be nice as well. Have a look at the matrixStats package; it might bring what you're looking for: http://cran.r-project.org/web/packages/matrixStats /Henrik Nice package and very handy. It has colProds, too. But its functions take a matrix. ' Then the case for psum is more for convenience and speed -vs-colSums(rbind(x,y), na.rm=TRUE)), since rbind will copy x and y into a new matrix. ' Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] There is pmin and pmax each taking na.rm, how about psum?
Justin Talbot jtalbot at stanford.edu writes: Because that's inconsistent with pmin and pmax when two NAs are summed. x = c(1,3,NA,NA,5) y = c(2,NA,4,NA,1) colSums(rbind(x, y), na.rm = TRUE) [1] 3 3 4 0 6# actual [1] 3 3 4 NA 6 # desired But your desired result would be inconsistent with sum: sum(NA,NA,na.rm=TRUE) [1] 0 From a language definition perspective I think having psum return 0 here is right choice. Ok, you've sold me. psum(NA,NA,na.rm=TRUE) returning 0 sounds good. And pprod(NA,NA,na.rm=TRUE) returning 1, consistent with prod then. Then the case for psum is more for convenience and speed -vs- colSums(rbind(x,y), na.rm=TRUE)), since rbind will copy x and y into a new matrix. The case for pprod is similar, plus colProds doesn't exist. Thus, + should have the signature: `+`(..., na.rm=FALSE), which would allow you to do things like: `+`(c(1,2),c(1,2),c(1,2),NA, na.rm=TRUE) = c(3,6) If you don't like typing `+`, you could always alias psum to `+`. But there would be a cost, wouldn't there? `+` is a dyadic .Primitive. Changing that to take `...` and `na.rm` could slow it down (iiuc), and any changes to the existing language are risky. For example : `+`(1,2,3) is currently an error. Changing that to do something might have implications for some of the 4,000 packages (some might rely on that being an error), with a possible speed cost too. In contrast, adding two functions that didn't exist before: psum and pprod, seems to be a safer and simpler proposition. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] There is pmin and pmax each taking na.rm, how about psum?
Hi, Please consider the following : x = c(1,3,NA,5) y = c(2,NA,4,1) min(x,y,na.rm=TRUE)# ok [1] 1 max(x,y,na.rm=TRUE)# ok [1] 5 sum(x,y,na.rm=TRUE)# ok [1] 16 pmin(x,y,na.rm=TRUE) # ok [1] 1 3 4 1 pmax(x,y,na.rm=TRUE) # ok [1] 2 3 4 5 psum(x,y,na.rm=TRUE) [1] 3 3 4 6 # expected result Error: could not find function psum # actual result I realise that + is already like psum, but what about NA? x+y [1] 3 NA NA 6# can't supply `na.rm=TRUE` to `+` Is there a case to add psum? Or have I missed something. This question survived when I asked on Stack Overflow : http://stackoverflow.com/questions/13123638/there-is-pmin-and-pmax-each-taking-na-rm-why-no-psum And a search of the archives found that has Gabor has suggested it too as an aside : http://r.789695.n4.nabble.com/How-to-do-it-without-for-loops-tp794745p794750.html If someone from R core is willing to sponsor the idea, I am willing to write, test and submit the code for psum. Implemented in a very similar fashion to pmin and pmax. Or perhaps it exists already in a package somewhere (I searched but didn't find it). Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] There is pmin and pmax each taking na.rm, how about psum?
Because that's inconsistent with pmin and pmax when two NAs are summed. x = c(1,3,NA,NA,5) y = c(2,NA,4,NA,1) colSums(rbind(x, y), na.rm = TRUE) [1] 3 3 4 0 6# actual [1] 3 3 4 NA 6 # desired and it would be less convenient/natural (and slower) than a psum which would call .Internal(psum(na.rm,...)) in the same way as pmin and pmax. Why don't you make a matrix and use colSums or rowSums? x = c(1,3,NA,5) y = c(2,NA,4,1) colSums(rbind(x, y), na.rm = TRUE) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie Kwaliteitszorg / team Biometrics Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium + 32 2 525 02 51 + 32 54 43 61 85 thierry.onkel...@inbo.be www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey -Oorspronkelijk bericht- Van: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] Namens Matthew Dowle Verzonden: dinsdag 30 oktober 2012 12:03 Aan: r-devel@r-project.org Onderwerp: [Rd] There is pmin and pmax each taking na.rm, how about psum? Hi, Please consider the following : x = c(1,3,NA,5) y = c(2,NA,4,1) min(x,y,na.rm=TRUE)# ok [1] 1 max(x,y,na.rm=TRUE)# ok [1] 5 sum(x,y,na.rm=TRUE)# ok [1] 16 pmin(x,y,na.rm=TRUE) # ok [1] 1 3 4 1 pmax(x,y,na.rm=TRUE) # ok [1] 2 3 4 5 psum(x,y,na.rm=TRUE) [1] 3 3 4 6 # expected result Error: could not find function psum # actual result I realise that + is already like psum, but what about NA? x+y [1] 3 NA NA 6# can't supply `na.rm=TRUE` to `+` Is there a case to add psum? Or have I missed something. This question survived when I asked on Stack Overflow : http://stackoverflow.com/questions/13123638/there-is-pmin-and-pmax-each-taking-na-rm-why-no-psum And a search of the archives found that has Gabor has suggested it too as an aside : http://r.789695.n4.nabble.com/How-to-do-it-without-for-loops-tp794745p794750.html If someone from R core is willing to sponsor the idea, I am willing to write, test and submit the code for psum. Implemented in a very similar fashion to pmin and pmax. Or perhaps it exists already in a package somewhere (I searched but didn't find it). Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * * Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Possible page inefficiency in do_matrix in array.c
Actually, my apologies, I was assuming that your example was based on the SO question while it is not at all (the code is not involved in that test case). Reversing the order does indeed cause a delay. Switching to a single index doesn't seem to have any impact. R-devel has the faster version now (which now also works with large vectors). Cheers, Simon I was intrigued why the compiler doesn't swap the loops when you thought it should, though. You're not usually wrong! From GCC's documentation (end of last paragraph is the most significant) : -floop-interchange Perform loop interchange transformations on loops. Interchanging two nested loops switches the inner and outer loops. For example, given a loop like: DO J = 1, M DO I = 1, N A(J, I) = A(J, I) * C ENDDO ENDDO loop interchange transforms the loop as if it were written: DO I = 1, N DO J = 1, M A(J, I) = A(J, I) * C ENDDO ENDDO which can be beneficial when N is larger than the caches, because in Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss. This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppl and --with-cloog to enable the Graphite loop transformation infrastructure. Could R build scripts be configured to set these gcc flags to turn on Graphite, then? I guess one downside could be the time to compile. Matthew On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote: On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote: In do_matrix in src/array.c there is a type switch containing : case LGLSXP : for (i = 0; i nr; i++) for (j = 0; j nc; j++) LOGICAL(ans)[i + j * NR] = NA_LOGICAL; That seems page inefficient, iiuc. Think it should be : case LGLSXP : for (j = 0; j nc; j++) for (i = 0; i nr; i++) LOGICAL(ans)[i + j * NR] = NA_LOGICAL; or more simply : case LGLSXP : for (i = 0; i nc*nr; i++) LOGICAL(ans)[i] = NA_LOGICAL; ( with some fine tuning required since NR is type R_xlen_t whilst i, nc and nr are type int ). Same goes for all the other types in that switch. This came up on Stack Overflow here : http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r That is completely irrelevant - modern compilers will optimize the loops accordingly and there is no difference in speed. If you don't believe it, run benchmarks ;) original microbenchmark(matrix(nrow=1, ncol=), times=10) Unit: milliseconds expr min lq median uq max 1 matrix(nrow = 1, ncol = ) 940.5519 940.6644 941.136 954.7196 1409.901 swapped microbenchmark(matrix(nrow=1, ncol=), times=10) Unit: milliseconds expr min lq median uq max 1 matrix(nrow = 1, ncol = ) 949.9638 950.6642 952.7497 961.001 1246.573 Cheers, Simon Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Possible page inefficiency in do_matrix in array.c
In do_matrix in src/array.c there is a type switch containing : case LGLSXP : for (i = 0; i nr; i++) for (j = 0; j nc; j++) LOGICAL(ans)[i + j * NR] = NA_LOGICAL; That seems page inefficient, iiuc. Think it should be : case LGLSXP : for (j = 0; j nc; j++) for (i = 0; i nr; i++) LOGICAL(ans)[i + j * NR] = NA_LOGICAL; or more simply : case LGLSXP : for (i = 0; i nc*nr; i++) LOGICAL(ans)[i] = NA_LOGICAL; ( with some fine tuning required since NR is type R_xlen_t whilst i, nc and nr are type int ). Same goes for all the other types in that switch. This came up on Stack Overflow here : http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Non ascii character on Mac on CRAN (C locale)
Dear all, A recent bug fix for data.table was for non-ascii characters in column names and grouping by those column. So, the package's test file now includes non-ascii characters to test that bug fix : # Test non ascii characters when passed as character by, #2134 x = rep(LETTERS[1:2], 3) y = rep(1:3, each=2) DT = data.table(ÅR=x, foo=y) test(708, names(DT[, mean(foo), by=ÅR]), c(ÅR,V1)) test(709, DT[, mean(foo), by=ÅR], DT[, mean(foo), by=ÅR]) DT = data.table(FÅR=x, foo=y) test(710, names(DT[, mean(foo), by=FÅR]), c(FÅR,V1)) DT = data.table(ÆØÅ=x, foo=y) test(711, DT[, mean(foo), by=ÆØÅ], data.table(ÆØÅ=c(A,B), V1=2)) test(712, DT[, mean(foo), by=ÆØÅ], data.table(ÆØÅ=c(A,B), V1=2)) This passes R CMD check on Linux, Windows and Mac on R-Forge, but not on Mac on CRAN because Prof Ripley advises that uses the C locale. It works on Windows because data.table does this first : oldenc = options(encoding=UTF-8)[[1L]] sys.source(tests.R) # the file that includes the tests above options(encoding=oldenc) If I change it to the following, will it work on CRAN's Mac, and is this ok/correct? Since it passes on R-Forge's Mac, I can't think how else to test this. oldlocale = Sys.getlocale(LC_CTYPE) if (oldlocale==C) Sys.setlocale(LC_CTYPE,en_GB.UTF-8) oldenc = options(encoding=UTF-8)[[1L]] sys.source(tests.R) # the file that includes the tests above options(encoding=oldenc) Sys.setlocalte(LC_CTYPE,oldlocale) Many thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Understanding tracemem
Hadley Wickham hadley at rice.edu writes: Why does x[5] - 5 create a copy That assigns 5 not 5L. x is being coerced from integer to double. x[5] - 5L doesn't copy. , when x[11] (which should be extending a vector does not) ? I can understand that maybe x[5] - 5 hasn't yet been optimised to not make a copy, but if that's the case then why doesn't x[11] - 11 make one? Extending a vector is creating a new (longer) vector and copying the old (shorter) one in. That's different to duplicate(). tracemem only reports calls to duplicate(). Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6
Matthew Dowle mdowle at mdowle.plus.com writes: Will check R-Forge again when it catches up. Thanks. Matthew Just to confirm, R-Forge has today caught up and is now using R r59554 which includes the fix for the problem in this thread. Its binary build of data.table is now installing fine on R 2.15.0 release, which it wasn't doing before. Many thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to change name of .so/.dll
On Tue, 2012-06-12 at 20:38 -0400, Simon Urbanek wrote: Something like all: $(SHLIB) mv $(SHLIB) datatable$(SHLIB_EXT) should do the trick (resist the temptation to create a datatable$(SHLIB_EXT) target - it doesn't work due to the makefile loading sequence, unfortunately). AFAIR you don't need to mess with install.libs because the default is to install all shlibs in the directory. Cheers, Simon Huge thank you, Simon. Works perfectly. +100! Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to change name of .so/.dll
Matthew Dowle mdowle at mdowle.plus.com writes: On Tue, 2012-06-12 at 20:38 -0400, Simon Urbanek wrote: Something like all: $(SHLIB) mv $(SHLIB) datatable$(SHLIB_EXT) should do the trick (resist the temptation to create a datatable$(SHLIB_EXT) target - it doesn't work due to the makefile loading sequence, unfortunately). AFAIR you don't need to mess with install.libs because the default is to install all shlibs in the directory. Cheers, Simon Huge thank you, Simon. Works perfectly. +100! Matthew I guess the 'mv' command works on Mac, too. For Windows I think I need to create pkg/src/Makevars.win with 'mv' replaced by 'rename'. Is that right? all: $(SHLIB) rename $(SHLIB) datatable$(SHLIB_EXT) I could try that and submit to winbuilder and see, but asking here as well in case theres anything else to consider for Windows. Thanks again, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to change name of .so/.dll
On 12-06-13 4:45 AM, Matthew Dowle wrote: Matthew Dowlemdowleat mdowle.plus.com writes: On Tue, 2012-06-12 at 20:38 -0400, Simon Urbanek wrote: Something like all: $(SHLIB) mv $(SHLIB) datatable$(SHLIB_EXT) should do the trick (resist the temptation to create a datatable$(SHLIB_EXT) target - it doesn't work due to the makefile loading sequence, unfortunately). AFAIR you don't need to mess with install.libs because the default is to install all shlibs in the directory. Cheers, Simon Huge thank you, Simon. Works perfectly. +100! Matthew I guess the 'mv' command works on Mac, too. For Windows I think I need to create pkg/src/Makevars.win with 'mv' replaced by 'rename'. Is that right? all: $(SHLIB) rename $(SHLIB) datatable$(SHLIB_EXT) I could try that and submit to winbuilder and see, but asking here as well in case theres anything else to consider for Windows. mv should be fine on Windows. If you have a makefile, you have Rtools installed, and mv is in Rtools. Duncan Murdoch Neat. Glad I asked, thanks. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] How to change name of .so/.dll
Hi, I've added R_init_data_table to the data.table package (which has a dot in its name). This works well in R 2.15.0, because of this from the Writing R Extensions manual : Note that there are some implicit restrictions on this mechanism as the basename of the DLL needs to be both a valid file name and valid as part of a C entry point (e.g. it cannot contain .): for portable code it is best to confine DLL names to be ASCII alphanumeric plus underscore. As from R 2.15.0, if entry point R_init_lib is not found it is also looked for with . replaced by _. But how do I confine the DLL name, is it an option in Makevars? The name of the shared object is currently data.table.so (data.table.dll on Windows). Is it possible to change the file name to datatable.so (and datatable.dll) in a portable way so that R_init_datatable works (without a dot), and, without Depend-ing on R=2.15.0 and without changing the name of the package? Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to change name of .so/.dll
Matthew Dowle wrote : Hi, I've added R_init_data_table to the data.table package (which has a dot in its name). This works well in R 2.15.0, because of this from the Writing R Extensions manual : Note that there are some implicit restrictions on this mechanism as the basename of the DLL needs to be both a valid file name and valid as part of a C entry point (e.g. it cannot contain .): for portable code it is best to confine DLL names to be ASCII alphanumeric plus underscore. As from R 2.15.0, if entry point R_init_lib is not found it is also looked for with . replaced by _. But how do I confine the DLL name, is it an option in Makevars? The name of the shared object is currently data.table.so (data.table.dll on Windows). Is it possible to change the file name to datatable.so (and datatable.dll) in a portable way so that R_init_datatable works (without a dot), and, without Depend-ing on R=2.15.0 and without changing the name of the package? Just to clarify, I'm aware R CMD SHLIB has the -o argument which can be used create datatable.so instead of data.table.so. It's R CMD INSTALL that's the problem as that seems to pass -o pkg_name to R CMD SHLIB. I found install.libs.R (added to R in 2.13.1), could that be used to create datatable.so instead of data.table.so? Or a line I could add to pkg/src/Makevars? Thanks! Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to change name of .so/.dll
Matthew Dowle wrote : Hi, I've added R_init_data_table to the data.table package (which has a dot in its name). This works well in R 2.15.0, because of this from the Writing R Extensions manual : Note that there are some implicit restrictions on this mechanism as the basename of the DLL needs to be both a valid file name and valid as part of a C entry point (e.g. it cannot contain .): for portable code it is best to confine DLL names to be ASCII alphanumeric plus underscore. As from R 2.15.0, if entry point R_init_lib is not found it is also looked for with . replaced by _. But how do I confine the DLL name, is it an option in Makevars? The name of the shared object is currently data.table.so (data.table.dll on Windows). Is it possible to change the file name to datatable.so (and datatable.dll) in a portable way so that R_init_datatable works (without a dot), and, without Depend-ing on R=2.15.0 and without changing the name of the package? Just to clarify, I'm aware R CMD SHLIB has the -o argument which can be used create datatable.so instead of data.table.so. It's R CMD INSTALL that's the problem as that seems to pass -o pkg_name to R CMD SHLIB. I found install.libs.R (added to R in 2.13.1), could that be used to create datatable.so instead of data.table.so? Or a line I could add to pkg/src/Makevars? Thanks! Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggest that as.double( something double ) not make a copy
Henrik Bengtsson hb at biostat.ucsf.edu writes: See also R-devel '[Rd] Suggestion for memory optimization and as.double() with friends', March 28-29 2007 [https://stat.ethz.ch/pipermail/r-devel/2007-March/045109.html]. /Henrik Interesting thread. So we have you to thank for instigating that 5 years ago: thanks! Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6
Prof Ripley wrote : That Depends line is about source installs. I can't see that documented in either Writing R Extensions or ?install.packages. Is it somewhere else? I thought Depends applied to binaries from CRAN too, which is the default method on Windows and Mac. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6
On 07/06/2012 11:40, Matthew Dowle wrote: Prof Ripley wrote : That Depends line is about source installs. I can't see that documented in either Writing R Extensions or ?install.packages. Is it somewhere else? I thought Depends applied to binaries from CRAN too, which is the default method on Windows and Mac. That field is documented under the description of a *source* package (see the first line of section 1.1, and it is in that section) and is simply copied from the source package for binary installs. It is the extra line added to the DESCRIPTION file, e.g. Built: R 2.15.0; x86_64-pc-mingw32; 2012-04-02 09:27:07 UTC; windows that tells you the version a binary package was built under (approximately for R-patched and R-devel), and library() checks. I'm fairly sure I understand all that. I'm still missing something more basic probably. Consider the follow workflow : I look on CRAN at package boot. Its webpage states Depends R (= 2.14.0). I'm a user running R and I know I use 2.14.1, so I think great I can use it. I install it as follows. version version.string R version 2.14.1 (2011-12-22) install.packages(boot) trying URL 'http://cran.ma.imperial.ac.uk/bin/windows/contrib/2.14/boot_1.3-4.zip' Content type 'application/zip' length 469615 bytes (458 Kb) opened URL downloaded 458 Kb package boot successfully unpacked and MD5 sums checked require(boot) Loading required package: boot Warning message: package boot was built under R version 2.14.2 Does this mean that CRAN maintainers expect me to run the latest version of the major release I'm using (R 2.14.2 in this case), not the current release of R (R 2.15.0 currently) as you wrote earlier? If that's the case I never realised it before, but that seems very reasonable. When I ran the above just now I expected it to say package 'boot' was built under R version 2.15.0. But it didn't, it said 2.14.2. So it seems to be my misunderstanding. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6
On 07/06/2012 12:49, Matthew Dowle wrote: On 07/06/2012 11:40, Matthew Dowle wrote: Prof Ripley wrote : That Depends line is about source installs. I can't see that documented in either Writing R Extensions or ?install.packages. Is it somewhere else? I thought Depends applied to binaries from CRAN too, which is the default method on Windows and Mac. That field is documented under the description of a *source* package (see the first line of section 1.1, and it is in that section) and is simply copied from the source package for binary installs. It is the extra line added to the DESCRIPTION file, e.g. Built: R 2.15.0; x86_64-pc-mingw32; 2012-04-02 09:27:07 UTC; windows that tells you the version a binary package was built under (approximately for R-patched and R-devel), and library() checks. I'm fairly sure I understand all that. I'm still missing something more basic probably. Consider the follow workflow : I look on CRAN at package boot. Its webpage states Depends R (= 2.14.0). I'm a user running R and I know I use 2.14.1, so I think great I can use it. I install it as follows. version version.string R version 2.14.1 (2011-12-22) install.packages(boot) trying URL 'http://cran.ma.imperial.ac.uk/bin/windows/contrib/2.14/boot_1.3-4.zip' Content type 'application/zip' length 469615 bytes (458 Kb) opened URL downloaded 458 Kb package boot successfully unpacked and MD5 sums checked require(boot) Loading required package: boot Warning message: package boot was built under R version 2.14.2 Does this mean that CRAN maintainers expect me to run the latest version of the major release I'm using (R 2.14.2 in this case), not the current release of R (R 2.15.0 currently) as you wrote earlier? If that's the case I never realised it before, but that seems very reasonable. When I ran the above just now I expected it to say package 'boot' was built under R version 2.15.0. But it didn't, it said 2.14.2. So it seems to be my misunderstanding. 2.15.x and 2.14.x are different series, with different binary repos. Thanks. So CRAN will continue to build and check new versions of packages using R 2.14.2 in the 2.14.x repo, whilst R 2.15.x progresses separately. I'm familiar with r-oldrel results on CRAN package check results page, but for some reason I had missed the nuance there's a binary repo too for r-oldrel. That's great. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6
I built R-trunk (rev 59537), ran 'Rtrunk CMD build data.table', installed the resulting tar.gz into R release and it ran tests ok. So it seems ok now, if that tested it right. Will check R-Forge again when it catches up. Thanks. Matthew On Wed, 2012-06-06 at 22:04 +0200, peter dalgaard wrote: FYI, Brian has backed out the changes to identical() in r59533 of R-patched. Please retry your test codes with the new version. (Due to some ISP mess-up, Brian is temporarily unable to reply in detail himself.) -pd On Jun 6, 2012, at 20:29 , luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Wed, 6 Jun 2012, Matthew Dowle wrote: Dan Tenenbaum dtenenba at fhcrc.org writes: I know this has come up before on R-help (http://r.789695.n4.nabble.com/7-arguments-passed-to-Internal-identical-which- requires-6-td4548460.html) but I have a concise reproducible case that I wanted to share. Also, please note the Bioconductor scenario which is potentially seriously impacted by this. The issue arises when a binary version of a package (like my example package below) is built under R 2.15.0 Patched but then installed under R 2.15.0. Our package AnnotationDbi (which hundreds of other packages depend on) is impacted by this issue to the extent that calling virtually any function in it will return something like this: Error in ls(2) : 7 arguments passed to .Internal(identical) which requires 6 My concern is that when R 2.15.1 is released and Bioconductor starts building all its packages under it, that R 2.15.0 users will start to experience this problem. We can ask all users to upgrade to R 2.15.1 if we have to, but it's not usually the case that a minor point release MUST be installed in order to run packages built under it (please correct me if I'm wrong). We would much prefer a workaround or fix to make an upgrade unnecessary. I'm seeing the same issue. Installing the latest R-Forge .zip of data.table built using 2.15.0 patched, on R 2.15.0 (or 2.14.1 same issue), then running data.table(a=1:3) produces the 7 arguments passed to .Internal(identical) which requires 6 error. traceback() and debugger() just display the top level call. debug(data.table) and stepping through reveals it is a call to identical () but just a regular one. No .Internal() call in the package, let alone passing 6 or 7 arguments to .Internal. Not sure how else to debug or trace it. R-Forge is byte compiling data.table using R 2.15.0 patched (iiuc), would that make a difference when the byte code is loaded into 2.15.0 which doesn't have the new argument in identical()? Yes it would. luke Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6
Dan Tenenbaum dtenenba at fhcrc.org writes: I know this has come up before on R-help (http://r.789695.n4.nabble.com/7-arguments-passed-to-Internal-identical-which- requires-6-td4548460.html) but I have a concise reproducible case that I wanted to share. Also, please note the Bioconductor scenario which is potentially seriously impacted by this. The issue arises when a binary version of a package (like my example package below) is built under R 2.15.0 Patched but then installed under R 2.15.0. Our package AnnotationDbi (which hundreds of other packages depend on) is impacted by this issue to the extent that calling virtually any function in it will return something like this: Error in ls(2) : 7 arguments passed to .Internal(identical) which requires 6 My concern is that when R 2.15.1 is released and Bioconductor starts building all its packages under it, that R 2.15.0 users will start to experience this problem. We can ask all users to upgrade to R 2.15.1 if we have to, but it's not usually the case that a minor point release MUST be installed in order to run packages built under it (please correct me if I'm wrong). We would much prefer a workaround or fix to make an upgrade unnecessary. I'm seeing the same issue. Installing the latest R-Forge .zip of data.table built using 2.15.0 patched, on R 2.15.0 (or 2.14.1 same issue), then running data.table(a=1:3) produces the 7 arguments passed to .Internal(identical) which requires 6 error. traceback() and debugger() just display the top level call. debug(data.table) and stepping through reveals it is a call to identical () but just a regular one. No .Internal() call in the package, let alone passing 6 or 7 arguments to .Internal. Not sure how else to debug or trace it. R-Forge is byte compiling data.table using R 2.15.0 patched (iiuc), would that make a difference when the byte code is loaded into 2.15.0 which doesn't have the new argument in identical()? Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggest that as.double( something double ) not make a copy
Tim Hesterberg timhesterberg at gmail.com writes: I've been playing with passing arguments to .C(), and found that replacing as.double(x) with if(is.double(x)) x else as.double(x) saves time and avoids one copy, in the case that x is already double. I suggest modifying as.double to avoid the extra copy and just return x, when x is already double. Similarly for as.integer, etc. But as.double() already doesn't copy if its argument is already double. Unless, your double has attributes? From coerce.c : if(TYPEOF(x) == type) { if(ATTRIB(x) == R_NilValue) return x; ans = NAMED(x) ? duplicate(x) : x; CLEAR_ATTRIB(ans); return ans; } quick test : x=1 .Internal(inspect(x)) @03E23620 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1 .Internal(inspect(as.double(x))) # no copy @03E23620 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1 x=c(foo=1) # give x some attributes, say names x foo 1 .Internal(inspect(x)) @03E234D0 14 REALSXP g0c1 [NAM(1),ATT] (len=1, tl=0) 1 ATTRIB: @03D54910 02 LISTSXP g0c0 [] TAG: @00380088 01 SYMSXP g0c0 [MARK,gp=0x4000] names @03E234A0 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0) @03E23560 09 CHARSXP g0c1 [gp=0x21] foo .Internal(inspect(as.double(x))) # strips attribs returning new obj @03E233B0 14 REALSXP g0c1 [] (len=1, tl=0) 1 as.double(x) [1] 1 Attribute stripping is documented in ?as.double. Rather than as.double() on the R side, you could use coerceVector() on the C side, which might be easier to use via .Call than .C since it takes an SEXP. Looking at coerceVector in coerce.c its first line returns immediately if type is already the desired type, with no attribute stripping, so that seems like the way to go? If your double has no attributes then I'm barking up the wrong tree. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Expected behaviour of is.unsorted?
Duncan Murdoch murdoch.duncan at gmail.com writes: On 12-05-23 4:37 AM, Matthew Dowle wrote: Hi, I've read ?is.unsorted and searched. Have found a few items but nothing close, yet. Is the following expected? is.unsorted(data.frame(1:2)) [1] FALSE is.unsorted(data.frame(2:1)) [1] FALSE is.unsorted(data.frame(1:2,3:4)) [1] TRUE is.unsorted(data.frame(2:1,4:3)) [1] TRUE IIUC, is.unsorted is intended for atomic vectors only (description of x in ?is.unsorted). Indeed the C source (src/main/sort.c) contains an error message only atomic vectors can be tested to be sorted. So that is the error message I expected to see in all cases above, since I know that data.frame is not an atomic vector. But there is also this in ?is.unsorted: except for atomic vectors and objects with a class (where the= or method is used) which I don't understand. Where= or is used by what, and where? If you look at the source, you will see that the basic test for classed objects is all(x[-1L] = x[-length(x)]) (in the function base:::.gtn). This comparison doesn't really makes sense for dataframes, but it does seem to be backwards: that tests that x[2] = x[1], x[3] = x[2], etc., returning TRUE if all comparisons are TRUE: but that sounds like it should be is.sorted(), not is.unsorted(). Or is it my brain that is backwards? Thanks. Yes you're right. So is.unsorted() on a data.frame is trying to tell us if there exists any unsorted row, it seems. DF = data.frame(a=c(1,3,5),b=c(1,3,5)) DF a b 1 1 1 # this row is sorted 2 3 3 # this row is sorted 3 5 5 # this row is sorted is.unsorted(DF) # going by row but should be !.gtn [1] TRUE with(DF,is.unsorted(order(a,b))) # most people's natural expectation I guess [1] FALSE DF[2,2]=2 DF a b 1 1 1 # this row is sorted 2 3 2 # this row isn't sorted 3 5 5 # this row is sorted is.unsorted(DF) # going by row but should be !.gtn [1] FALSE with(DF,is.unsorted(order(a,b))) # most people's natural expectation I guess [1] FALSE Since it seems to have a bug anyway (and if so, can't be correct in anyone's use of it), could either is.unsorted on a data.frame return the error that's in the C code already: only atomic vectors can be tested to be sorted, for safety and to lessen confusion, or be changed to return the natural expectation proposed above? The easiest quick fix would be to negate the result of the .gtn call of course, but then you could never go back. Matthew Duncan Murdoch I understand why the first two are FALSE (1 item of anything must be sorted). I don't understand the 3rd and 4th cases where length is 2: do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does that fall back to TRUE for some reason? Matthew sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.0 loaded via a namespace (and not attached): [1] tools_2.15.0 __ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Expected behaviour of is.unsorted?
Duncan Murdoch murdoch.duncan at gmail.com writes: On 12-05-24 7:39 AM, Matthew Dowle wrote: Duncan Murdochmurdoch.duncanat gmail.com writes: On 12-05-23 4:37 AM, Matthew Dowle wrote: Since it seems to have a bug anyway (and if so, can't be correct in anyone's use of it), could either is.unsorted on a data.frame return the error that's in the C code already: only atomic vectors can be tested to be sorted, for safety and to lessen confusion, or be changed to return the natural expectation proposed above? The easiest quick fix would be to negate the result of the .gtn call of course, but then you could never go back. I don't follow the last sentence. If the .gtn call needs to be negated, why would you want to go back? Because then is.unsorted(DF) would work, but go by row, which you guessed above wasn't intended and isn't sensible. But once it worked in that way, users might start to depend on it; e.g., by writing is.unsorted(t(DF)). If I came along in future and suggested that was inefficient and wouldn't it be more natural and efficient if is.unsorted(DF) went by column, returning the same as with(DF,is.unsorted(order(a,b))) but implemented efficiently, you would fear that user code now depended on it going by row and say it was too late. I'd persist and highlight that it didn't seem in keeping with the spirit of is.unsorted()'s speed since it short circuits on the first unsorted item, which is why we love it. You'd reply that's not documented. Which it isn't. And that would be the end of that. Duncan Murdoch Matthew Duncan Murdoch I understand why the first two are FALSE (1 item of anything must be sorted). I don't understand the 3rd and 4th cases where length is 2: do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does that fall back to TRUE for some reason? Matthew sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.0 loaded via a namespace (and not attached): [1] tools_2.15.0 __ R-develat r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Expected behaviour of is.unsorted?
On 24/05/2012 9:15 AM, Matthew Dowle wrote: Duncan Murdochmurdoch.duncanat gmail.com writes: On 12-05-24 7:39 AM, Matthew Dowle wrote: Duncan Murdochmurdoch.duncanat gmail.com writes: On 12-05-23 4:37 AM, Matthew Dowle wrote: Since it seems to have a bug anyway (and if so, can't be correct in anyone's use of it), could either is.unsorted on a data.frame return the error that's in the C code already: only atomic vectors can be tested to be sorted, for safety and to lessen confusion, or be changed to return the natural expectation proposed above? The easiest quick fix would be to negate the result of the .gtn call of course, but then you could never go back. I don't follow the last sentence. If the .gtn call needs to be negated, why would you want to go back? Because then is.unsorted(DF) would work, but go by row, which you guessed above wasn't intended and isn't sensible. But once it worked in that way, users might start to depend on it; e.g., by writing is.unsorted(t(DF)). If I came along in future and suggested that was inefficient and wouldn't it be more natural and efficient if is.unsorted(DF) went by column, returning the same as with(DF,is.unsorted(order(a,b))) but implemented efficiently, you would fear that user code now depended on it going by row and say it was too late. I'd persist and highlight that it didn't seem in keeping with the spirit of is.unsorted()'s speed since it short circuits on the first unsorted item, which is why we love it. You'd reply that's not documented. Which it isn't. And that would be the end of that. Okay, I'm going to fix the handling of .gtn results, and document the unsuitability of this function for dataframes and arrays. But that leaves the door open to confusion later, whilst closing the door to a better solution: making is.unsorted() work by column for data.frame; i.e., making is.unsorted _suitable_ for data.frame. If you just do the quick fix for .gtn result you can never go back. If making is.unsorted(DF) work by column is too hard for now, then leaving the door open would be better by returning the error message already in the C code: only atomic vectors can be tested to be sorted. That would be a better quick fix since it leaves options for the future. Duncan Murdoch Duncan Murdoch Matthew Duncan Murdoch I understand why the first two are FALSE (1 item of anything must be sorted). I don't understand the 3rd and 4th cases where length is 2: do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does that fall back to TRUE for some reason? Matthew sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.0 loaded via a namespace (and not attached): [1] tools_2.15.0 __ R-develat r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-develat r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Expected behaviour of is.unsorted?
On 24/05/2012 11:10 AM, Matthew Dowle wrote: On 24/05/2012 9:15 AM, Matthew Dowle wrote: Duncan Murdochmurdoch.duncanat gmail.com writes: On 12-05-24 7:39 AM, Matthew Dowle wrote: Duncan Murdochmurdoch.duncanatgmail.comwrites: On 12-05-23 4:37 AM, Matthew Dowle wrote: Since it seems to have a bug anyway (and if so, can't be correct in anyone's use of it), could either is.unsorted on a data.frame return the error that's in the C code already: only atomic vectors can be tested to be sorted, for safety and to lessen confusion, or be changed to return the natural expectation proposed above? The easiest quick fix would be to negate the result of the .gtn call of course, but then you could never go back. I don't follow the last sentence. If the .gtn call needs to be negated, why would you want to go back? Because then is.unsorted(DF) would work, but go by row, which you guessed above wasn't intended and isn't sensible. But once it worked in that way, users might start to depend on it; e.g., by writing is.unsorted(t(DF)). If I came along in future and suggested that was inefficient and wouldn't it be more natural and efficient if is.unsorted(DF) went by column, returning the same as with(DF,is.unsorted(order(a,b))) but implemented efficiently, you would fear that user code now depended on it going by row and say it was too late. I'd persist and highlight that it didn't seem in keeping with the spirit of is.unsorted()'s speed since it short circuits on the first unsorted item, which is why we love it. You'd reply that's not documented. Which it isn't. And that would be the end of that. Okay, I'm going to fix the handling of .gtn results, and document the unsuitability of this function for dataframes and arrays. But that leaves the door open to confusion later, whilst closing the door to a better solution: making is.unsorted() work by column for data.frame; i.e., making is.unsorted _suitable_ for data.frame. If you just do the quick fix for .gtn result you can never go back. If making is.unsorted(DF) work by column is too hard for now, then leaving the door open would be better by returning the error message already in the C code: only atomic vectors can be tested to be sorted. That would be a better quick fix since it leaves options for the future. I don't see why saying this function is unsuitable for dataframes implies that it will never be made suitable for dataframes. If user code or packages start to depend on is.unsorted(t(DF)) it would be harder to change, no? Why provide something that is unsuitable and allow that possibility to happen? It's more user friendly to return not implemented, unsuitable, or the nicer message already in the C code, than leave the door open for confusion and errors. Or in other words, it's even more user friendly to return a warning or error to the user at the prompt, than the user friendliness of writing in the help file that it's unsuitable for data.frame. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Expected behaviour of is.unsorted?
Hi, I've read ?is.unsorted and searched. Have found a few items but nothing close, yet. Is the following expected? is.unsorted(data.frame(1:2)) [1] FALSE is.unsorted(data.frame(2:1)) [1] FALSE is.unsorted(data.frame(1:2,3:4)) [1] TRUE is.unsorted(data.frame(2:1,4:3)) [1] TRUE IIUC, is.unsorted is intended for atomic vectors only (description of x in ?is.unsorted). Indeed the C source (src/main/sort.c) contains an error message only atomic vectors can be tested to be sorted. So that is the error message I expected to see in all cases above, since I know that data.frame is not an atomic vector. But there is also this in ?is.unsorted: except for atomic vectors and objects with a class (where the = or method is used) which I don't understand. Where = or is used by what, and where? I understand why the first two are FALSE (1 item of anything must be sorted). I don't understand the 3rd and 4th cases where length is 2: do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does that fall back to TRUE for some reason? Matthew sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.0 loaded via a namespace (and not attached): [1] tools_2.15.0 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] test suites for packages
Uwe Ligges ligges at statistik.tu-dortmund.de writes: On 17.05.2012 16:52, Brian G. Peterson wrote: On Thu, 2012-05-17 at 16:32 +0200, Uwe Ligges wrote: Yes: R CMD check does the trick. See Writing R Extension and read about a package's test directory. I prefer frameworks that do not obfuscate failing test results on the CRAN check farm (as most other frameworks I have seen). Uwe: I don't think that's completely fair. RUnit and testthat tests can be configured to be called from the R package tests directory, so that they are run during R CMD check. They don't *need* to be configured that way, so perhaps that's what you're talking about. I am talking about the problem that relevant output of test failures that may help to identify the problem is frequently not shown in the output of R CMD check when such frameworks are used - that is a major nuisance for CRAN automatisms. Not sure, but could it be that in some cases the output of test failures is there, but chopped off since CRAN displays the 13 line tail? At least that's what I've experienced, and reported, and asked to be increased in the past. Often the first error causes a cascade, so it's the head you need to see, not the tail. If I've got that right, how about a much larger limit than 13, say 1000. Or the first 50 and last 50 lines of output. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
Antonio Piccolboni antonio at piccolboni.info writes: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks Antonio Perhaps r-help or Stack Overflow would have been more appropriate to try first, before r-devel. If you did, please say so. Answering anyway. Do you really want to split every single row? What's the bigger picture? Perhaps you don't need to split at all. On the off chance that the example was just for exposition, and applying some (biased) guesswork, have you seen the data.table package? It doesn't use the split-apply-combine paradigm because, as your (extreme) example shows, that doesn't scale. When you use the 'by' argument of [.data.table, it allocates memory once for the largest group. Then it reuses that same memory for each group. That's one reason it's fast and memory efficient at grouping (an order of magnitude faster than tapply). Independent timings : http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/ If you really do want to split every single row, then DT[,something,by=1:nrow(DT)] will give perhaps two orders of magnitude speedup, but that's an unfair example because it isn't very realistic. Scaling applies to the size of the data.frame, and, how much you want to split it up. Your example is extreme in the latter but not the former. data.table scales in both. It's nothing to do with the interpreter, btw, just memory usage. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Byte compilation of packages on CRAN
On 11/04/2012 20:36, Matthew Dowle wrote: In DESCRIPTION if I set LazyLoad to 'yes' will data.table (for example) then be byte compiled for users who install the binary package from CRAN on Windows? No. LazyLoad is distinct from byte compilation. All installed packages use lazy loading these days (for simplicity: a very few do not benefit from it as they use all their objects at startup). This question is based on reading section 1.2 of this document : http://www.divms.uiowa.edu/~luke/R/compiler/compiler.pdf I've searched r-devel and Stack Overflow history and have found questions and answers relating to R CMD INSTALL and install.packages() from source, but no answer (as yet) about why binary packages for Windows appear not to be byte compiled. If so, is there any reason why all packages should not set LazyLoad to 'yes'. And if not, could LazyLoad be 'yes' by default? I wonder why you are not reading R's own documentation. 'Writing R Extensions' says 'The `LazyData' logical field controls whether the R datasets use lazy-loading. A `LazyLoad' field was used in versions prior to 2.14.0, but now is ignored. The `ByteCompile' logical field controls if the package code is byte-compiled on installation: the default is currently not to, so this may be useful for a package known to benefit particularly from byte-compilation (which can take quite a long time and increases the installed size of the package).' Oops, somehow missed that. Thank you! Note that the majority of CRAN packages benefit very little from byte-compilation because almost all the time of their computations is spent in compiled code. And the increased size also may matter when the code is loaded into R. Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Byte compilation of packages on CRAN
In DESCRIPTION if I set LazyLoad to 'yes' will data.table (for example) then be byte compiled for users who install the binary package from CRAN on Windows? This question is based on reading section 1.2 of this document : http://www.divms.uiowa.edu/~luke/R/compiler/compiler.pdf I've searched r-devel and Stack Overflow history and have found questions and answers relating to R CMD INSTALL and install.packages() from source, but no answer (as yet) about why binary packages for Windows appear not to be byte compiled. If so, is there any reason why all packages should not set LazyLoad to 'yes'. And if not, could LazyLoad be 'yes' by default? Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] CRAN policies
Mark.Bravington at csiro.au writes: There must be over 2000 people who have written CRAN packages by now; every extra check and non-back-compatible additional requirement runs the risk of generating false-negatives and incurring many extra person-hours to fix non-problems. Plus someone needs to document and explain the check (adding to the rule mountain), plus there is the time spent in discussions like this..! Not sure where you're coming from on that. For example, Prof Ripley has added quite a few new NOTEs to QC.R over the last few months. These caught things I wasn't aware of in the two packages I maintain and I was more than happy to fix them. It improves quality, surely. There's only one particular NOTE causing an issue: 'no visible binding'. If it were made a MEMO, we can move on. All the other NOTEs can (and should) be fixed, can't they? Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] CRAN policies
William Dunlap wdunlap at tibco.com writes: -Original Message- The survival package has a similar special case: the routines for expected population survival are set up to accept multiple types of date format so have lines like if (class(x) == 'chron') { y - as.numeric(x - chron(01/01/1960)} This leaves me with two extraneous no visible binding messages. Suppose we defined a function like NO_VISIBLE_BINDING(expr) expr and added an entry to the stuff in codetools so that it would not check for misspelled object names in call to NO_VISIBLE_BINDING. Then Terry could write that line as if (class(x) == chron) { y - as.numeric(x - NO_VISIBLE_BINDING(chron) (01/01/1960)} and the Notes would disappear. That's ok for package code, but what about test suites? Say there was a test on the result of with(DF,a+b), you wouldn't want to change the test to with (DF,NO_VISIBLE_BINDING(a)+NO_VISIBLE_BINDING(b)) not just because that's long and onerous, but because that's *changing* the test i.e. introducing a difference between what's tested and what user code will do. As others suggested, how about a new category: MEMO. The no visible binding NOTE would be downgraded to MEMO. CRAN maintainers could then ignore MEMOs more easily. What I really like about NOTES is that when new checks are added to R then as a package maintainer you know you don't have to fix them straight away. If a new WARNING shows up on r-devel daily checks, however, then you've got some warning about the WARNING that you need to fix more urgently and may even accelerate a release. So it's not just about checks when submitting a package, but what happens afterwards as R itself (and packages in Depends) move on. In other words, you know you need to fix new NOTES but not as urgently as new WARNINGS. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] merge bug fix in R 2.15.0
Anyone? Is it intended that the first suffix can no longer be blank? Seems to be caused by a bug fix to merge in R 2.15.0. $Rdevel --vanilla DF1 = data.frame(a=1:3,b=4:6) DF2 = data.frame(a=1:3,b=7:9) merge(DF1,DF2,by=a,suffixes=c(,.1)) Error in merge.data.frame(DF1, DF2, by = a, suffixes = c(, .1)) : there is already a column named b $R --vanilla R version 2.14.2 (2012-02-29) DF1 = data.frame(a=1:3,b=4:6) DF2 = data.frame(a=1:3,b=7:9) merge(DF1,DF2,by=a,suffixes=c(,.1)) a b b.1 1 1 4 7 2 2 5 8 3 3 6 9 Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] merge bug fix in R 2.15.0
Is it intended that the first suffix can no longer be blank? Seems to be caused by a bug fix to merge in R 2.15.0. $Rdevel --vanilla DF1 = data.frame(a=1:3,b=4:6) DF2 = data.frame(a=1:3,b=7:9) merge(DF1,DF2,by=a,suffixes=c(,.1)) Error in merge.data.frame(DF1, DF2, by = a, suffixes = c(, .1)) : there is already a column named b $R --vanilla R version 2.14.2 (2012-02-29) DF1 = data.frame(a=1:3,b=4:6) DF2 = data.frame(a=1:3,b=7:9) merge(DF1,DF2,by=a,suffixes=c(,.1)) a b b.1 1 1 4 7 2 2 5 8 3 3 6 9 Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] 111 FIXMEs in main/src
Hi, We sometimes see offers to contribute, asking what needs to be done. If they know C, how about the 111 FIXMEs? But which ones would be most useful to fix? Which are difficult and which are easy? Does R-core have a process to list and prioritise the FIXMEs? ~/R/Rtrunk/src/main$ grep [^/]FIXME * | wc -l 111 ~/R/Rtrunk/src/main$ grep -A 1 [^/]FIXME * arithmetic.c:/* FIXME: consider using arithmetic.c-tmp = (long double)x1 - floor(q) * (long double)x2; -- arithmetic.c:/* FIXME: with the y == 2.0 test now at the top that case isn't arithmetic.c- reached here, but i have left it for someone who understands the -- arithmetic.c:/* FIXME: Danger Will Robinson. arithmetic.c- * - We might be trashing arguments here. -- array.c:/* FIXME: the following is desirable, but pointless as long as array.c- subset.c others have a contrary version that leaves the -- attrib.c:/* FIXME: 1.e-5 should rather be == option('ts.eps') !! */ attrib.c-if (fabs(end - start - (n - 1)/frequency) 1.e-5) -- attrib.c: /* FIXME : The whole classgets may as well die. */ attrib.c- -- attrib.c:/* FIXME */ attrib.c-if (nvalues = 0) -- attrib.c:/* FIXME */ attrib.c-PROTECT(namesattr); -- attrib.c:/* FIXME: the code below treats pair-based structures */ attrib.c-/* in a special way. This can probably be dropped down */ -- base.c:/* FIXME: Make this a macro to avoid function call overhead? base.c- Inline it if you really think it matters. -- bind.c:/* FIXME : is there another possibility? */ bind.c- -- bind.c: /* FIXME: I'm not sure what the author intended when the sequence was bind.c-defined as raw logical -- it is possible to represent logical as -- builtin.c: /* FIXME -- Rstrlen allows for double-width chars */ builtin.c- width += Rstrlen(STRING_ELT(labs, nlines % lablen), 0) + 1; -- builtin.c:/* FIXME: call EncodeElement() for every element of s. builtin.c- -- builtin.c: /* FIXME : cat(...) should handle ANYTHING */ builtin.c- w = strlen(p); -- character.c:slen = strlen(ss); /* FIXME -- should handle embedded nuls */ character.c-buf = R_AllocStringBuffer(slen+1, cbuff); -- character.c: FIXME: could prefer UTF-8 here character.c- */ -- character.c:/* FIXME: could use R_Realloc instead */ character.c-cbuf = CallocCharBuf(strlen(tmp) + 1); -- character.c:/* FIXME use this buffer for new string as well */ character.c-wc = (wchar_t *) -- coerce.c:/* FIXME: Use coerce.c- = -- complex.c:/* FIXME: maybe add full IEC60559 support */ complex.c-static double complex clog(double complex x) -- complex.c:/* FIXME: check/add full IEC60559 support */ complex.c-static double complex cexp(double complex x) -- connections.c:/* FIXME: is this correct for consoles? */ connections.c-checkArity(op, args); -- connections.c:/* FIXME: could do any MBCS locale, but would need pushback */ connections.c-static SEXP -- connections.c: outlen = 1.01 * inlen + 600; /* FIXME, copied from bzip2 */ connections.c- buf = (unsigned char *) R_alloc(outlen, sizeof(unsigned char)); -- datetime.c: /* FIXME some of this should be outside the loop */ datetime.c- int ns, nused = 4; -- dcf.c: /* FIXME: dcf.c- Why are we doing this? -- debug.c:/* FIXME: previous will have 0x whereas other values are debug.c- without the */ -- deriv.c:/* FIXME: simplify exp(lgamma( E )) = gamma( E ) */ deriv.c-ans = lang2(ExpSymbol, arg1); -- deriv.c:/* FIXME: simplify log(gamma( E )) = lgamma( E ) */ deriv.c-ans = lang2(LogSymbol, arg1); -- deriv.c:/* FIXME */ deriv.c-#ifdef NOTYET -- devices.c:/* FIXME Disable this for now */ devices.c-/* -- devices.c:/* FIXME: There should really be a formal graphics finaliser devices.c- * but this is a good proxy for now. -- devices.c:/* FIXME: there should be a way for a device to declare its own devices.c- events, and return information on how to set them */ -- dounzip.c: filename is in UTF-8, so FIXME */ dounzip.c- SET_STRING_ELT(names, i, mkChar(filename_inzip)); -- duplicate.c: FIXME: surely memcpy would be faster here? duplicate.c-*/ -- engine.c:/* FIXME: what about clipping? (if the device can't) engine.c-*/ -- engine.c:/* FIXME: what about clipping? (if the device can't) engine.c- * Maybe not too bad because it is just a matter of shaving off -- engine.c: /* FIXME: This assumes that wchar_t is UCS-2/4, engine.c- since that is what GEMetricInfo expects */ -- engine.c:/* FIXME: should we warn on more than one character here? */ engine.c-int GEstring_to_pch(SEXP pch) -- envir.c: FIXME ? should this also
[Rd] Identical copy of base function
Hello, Regarding this in R-devel/NEWS/New features : o library(pkg) no longer warns about a conflict with a function from package:base if the function is an identical copy of the base one but with a different environment. Why would one want an identical copy in a different environment? I'm thinking I may be missing out on a trick here. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] names- appears to copy 3 times?
Hi, $ R --vanilla R version 2.14.1 (2011-12-22) Platform: i686-pc-linux-gnu (32-bit) DF = data.frame(a=1:3,b=4:6) DF a b 1 1 4 2 2 5 3 3 6 tracemem(DF) [1] 0x8898098 names(DF)[2]=B tracemem[0x8898098 - 0x8763e18]: tracemem[0x8763e18 - 0x8766be8]: tracemem[0x8766be8 - 0x8766b68]: DF a B 1 1 4 2 2 5 3 3 6 Are those 3 copies really taking place? Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Confused about NAMED
Hi, I expected NAMED to be 1 in all these three cases. It is for one of them, but not the other two? R --vanilla R version 2.14.0 (2011-10-31) Platform: i386-pc-mingw32/i386 (32-bit) x = 1L .Internal(inspect(x)) # why NAM(2)? expected NAM(1) @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1 y = 1:10 .Internal(inspect(y)) # NAM(1) as expected but why different to x? @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,... z = data.frame() .Internal(inspect(z)) # why NAM(2)? expected NAM(1) @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0) ATTRIB: @24fc270 02 LISTSXP g0c0 [] TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names @24fc334 16 STRSXP g0c0 [] (len=0, tl=0) TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names @24fc318 13 INTSXP g0c0 [] (len=0, tl=0) TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class @25be500 16 STRSXP g0c1 [] (len=1, tl=0) @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame It's a little difficult to search for the word named but I tried and found this in R-ints : Note that optimizing NAMED = 1 is only effective within a primitive (as the closure wrapper of a .Internal will set NAMED = 2 when the promise to the argument is evaluated) So might it be that just looking at NAMED using .Internal(inspect()) is setting NAMED=2? But if so, why does y have NAMED==1? Thanks! Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Confused about NAMED
On Nov 24, 2011, at 11:13 , Matthew Dowle wrote: Hi, I expected NAMED to be 1 in all these three cases. It is for one of them, but not the other two? R --vanilla R version 2.14.0 (2011-10-31) Platform: i386-pc-mingw32/i386 (32-bit) x = 1L .Internal(inspect(x)) # why NAM(2)? expected NAM(1) @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1 y = 1:10 .Internal(inspect(y)) # NAM(1) as expected but why different to x? @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,... z = data.frame() .Internal(inspect(z)) # why NAM(2)? expected NAM(1) @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0) ATTRIB: @24fc270 02 LISTSXP g0c0 [] TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names @24fc334 16 STRSXP g0c0 [] (len=0, tl=0) TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names @24fc318 13 INTSXP g0c0 [] (len=0, tl=0) TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class @25be500 16 STRSXP g0c1 [] (len=1, tl=0) @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame It's a little difficult to search for the word named but I tried and found this in R-ints : Note that optimizing NAMED = 1 is only effective within a primitive (as the closure wrapper of a .Internal will set NAMED = 2 when the promise to the argument is evaluated) So might it be that just looking at NAMED using .Internal(inspect()) is setting NAMED=2? But if so, why does y have NAMED==1? This is tricky business... I'm not quite sure I'll get it right, but let's try When you are assigning a constant, the value you assign is already part of the assignment expression, so if you want to modify it, you must duplicate. So NAMED==2 on z - 1 is basically to prevent you from accidentally changing the value of 1. If it weren't, then you could get bitten by code like for(i in 1:2) {z - 1; if(i==1) z[1] - 2}. If you're assigning the result of a computation, then the object only exists once, so z - 0+1 gets NAMED==1. However, if the computation is done by returning a named value from within a function, as in f - function(){v - 1+0; v} z - f() then again NAMED==2. This is because the side effects of the function _might_ result in something having a hold on the function environment, e.g. if we had e - NULL f - function(){e -environment(); v - 1+0; v} z - f() then z[1] - 5 would change e$v too. As it happens, there aren't any side effects in the forme case, but R loses track and assumes the worst. Thanks a lot, think I follow. That explains x vs y, but why is z NAMED==2? The result of data.frame() is an object that exists once (similar to 1:10) so shouldn't it be NAMED==1 too? Or, R loses track and assumes the worst even on its own functions such as data.frame()? Thanks! Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Confused about NAMED
On Nov 24, 2011, at 12:34 , Matthew Dowle wrote: On Nov 24, 2011, at 11:13 , Matthew Dowle wrote: Hi, I expected NAMED to be 1 in all these three cases. It is for one of them, but not the other two? R --vanilla R version 2.14.0 (2011-10-31) Platform: i386-pc-mingw32/i386 (32-bit) x = 1L .Internal(inspect(x)) # why NAM(2)? expected NAM(1) @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1 y = 1:10 .Internal(inspect(y)) # NAM(1) as expected but why different to x? @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,... z = data.frame() .Internal(inspect(z)) # why NAM(2)? expected NAM(1) @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0) ATTRIB: @24fc270 02 LISTSXP g0c0 [] TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names @24fc334 16 STRSXP g0c0 [] (len=0, tl=0) TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names @24fc318 13 INTSXP g0c0 [] (len=0, tl=0) TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class @25be500 16 STRSXP g0c1 [] (len=1, tl=0) @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame It's a little difficult to search for the word named but I tried and found this in R-ints : Note that optimizing NAMED = 1 is only effective within a primitive (as the closure wrapper of a .Internal will set NAMED = 2 when the promise to the argument is evaluated) So might it be that just looking at NAMED using .Internal(inspect()) is setting NAMED=2? But if so, why does y have NAMED==1? This is tricky business... I'm not quite sure I'll get it right, but let's try When you are assigning a constant, the value you assign is already part of the assignment expression, so if you want to modify it, you must duplicate. So NAMED==2 on z - 1 is basically to prevent you from accidentally changing the value of 1. If it weren't, then you could get bitten by code like for(i in 1:2) {z - 1; if(i==1) z[1] - 2}. If you're assigning the result of a computation, then the object only exists once, so z - 0+1 gets NAMED==1. However, if the computation is done by returning a named value from within a function, as in f - function(){v - 1+0; v} z - f() then again NAMED==2. This is because the side effects of the function _might_ result in something having a hold on the function environment, e.g. if we had e - NULL f - function(){e -environment(); v - 1+0; v} z - f() then z[1] - 5 would change e$v too. As it happens, there aren't any side effects in the forme case, but R loses track and assumes the worst. Thanks a lot, think I follow. That explains x vs y, but why is z NAMED==2? The result of data.frame() is an object that exists once (similar to 1:10) so shouldn't it be NAMED==1 too? Or, R loses track and assumes the worst even on its own functions such as data.frame()? R loses track. I suspect that is really all it can do without actual reference counting. The function data.frame is more than 150 lines of code, and if any of those end up invoking user code, possibly via a class method, you can't tell definitively whether or not the evaluation environment dies at the return. Ohhh, think I see now. After Duncan's reply I was going to ask if it was possible to change data.frame() to be primitive so it could set NAMED=1. But it seems primitive functions can't use R code so data.frame() would need to be ported to C. Ok! - not quick or easy, and not without consideable risk. And, data.frame() can invoke user code inside it anyway then. Since list() is primitive I tried to construct a data.frame starting with list() [since structure() isn't primitive], but then merely adding an attribute seems to set NAMED==2 too ? DF = list(a=1:3,b=4:6) .Internal(inspect(DF)) # so far so good: NAM(1) @25149e0 19 VECSXP g0c1 [NAM(1),ATT] (len=2, tl=0) @263ea50 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3 @263eaa0 13 INTSXP g0c2 [] (len=3, tl=0) 4,5,6 ATTRIB: @2457984 02 LISTSXP g0c0 [] TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names @25149c0 16 STRSXP g0c1 [] (len=2, tl=0) @1e987d8 09 CHARSXP g0c1 [MARK,gp=0x21] a @1e56948 09 CHARSXP g0c1 [MARK,gp=0x21] b attr(DF,foo) - bar# just adding an attribute sets NAM(2) ? .Internal(inspect(DF)) @25149e0 19 VECSXP g0c1 [NAM(2),ATT] (len=2, tl=0) @263ea50 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3 @263eaa0 13 INTSXP g0c2 [] (len=3, tl=0) 4,5,6 ATTRIB: @2457984 02 LISTSXP g0c0 [] TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names @25149c0 16 STRSXP g0c1 [] (len=2, tl=0) @1e987d8 09 CHARSXP g0c1 [MARK,gp=0x21] a @1e56948 09 CHARSXP g0c1 [MARK,gp=0x21] b TAG: @245732c 01 SYMSXP g0c0 [] foo @25148a0 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0) @2514920 09 CHARSXP g0c1 [gp=0x20] bar Matthew -- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com
Re: [Rd] Confused about NAMED
On Nov 24, 2011, at 14:05 , Matthew Dowle wrote: Since list() is primitive I tried to construct a data.frame starting with list() [since structure() isn't primitive], but then merely adding an attribute seems to set NAMED==2 too ? Yes. As soon as there is the slightest risk of having (had) two references to the same object NAMED==2 and it is never reduced. While your mind is boggling, I might boggle it a bit more: z - 1:10 .Internal(inspect(z)) @116e11788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,... m - mean(z) .Internal(inspect(z)) @116e11788 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,... This happens because while mean() is running, there is a second reference to z, namely mean's argument x. (With languages like R, you have no insurance that there will be no changes to the global environment while a function call is being evaluated, so bugs can bite in both places -- z or x.) There are many of these cases where you might pragmatically want to override the default NAMED logic, but you'd be stepping into treacherous waters. Luke has probably been giving these matters quite some thought in connection with his compiler project. Ok, very interesting. Think I'm there. Thanks for all the info. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Confused about NAMED
On Nov 24, 2011, at 8:05 AM, Matthew Dowle wrote: On Nov 24, 2011, at 12:34 , Matthew Dowle wrote: On Nov 24, 2011, at 11:13 , Matthew Dowle wrote: Hi, I expected NAMED to be 1 in all these three cases. It is for one of them, but not the other two? R --vanilla R version 2.14.0 (2011-10-31) Platform: i386-pc-mingw32/i386 (32-bit) x = 1L .Internal(inspect(x)) # why NAM(2)? expected NAM(1) @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1 y = 1:10 .Internal(inspect(y)) # NAM(1) as expected but why different to x? @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,... z = data.frame() .Internal(inspect(z)) # why NAM(2)? expected NAM(1) @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0) ATTRIB: @24fc270 02 LISTSXP g0c0 [] TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names @24fc334 16 STRSXP g0c0 [] (len=0, tl=0) TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names @24fc318 13 INTSXP g0c0 [] (len=0, tl=0) TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class @25be500 16 STRSXP g0c1 [] (len=1, tl=0) @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame It's a little difficult to search for the word named but I tried and found this in R-ints : Note that optimizing NAMED = 1 is only effective within a primitive (as the closure wrapper of a .Internal will set NAMED = 2 when the promise to the argument is evaluated) So might it be that just looking at NAMED using .Internal(inspect()) is setting NAMED=2? But if so, why does y have NAMED==1? This is tricky business... I'm not quite sure I'll get it right, but let's try When you are assigning a constant, the value you assign is already part of the assignment expression, so if you want to modify it, you must duplicate. So NAMED==2 on z - 1 is basically to prevent you from accidentally changing the value of 1. If it weren't, then you could get bitten by code like for(i in 1:2) {z - 1; if(i==1) z[1] - 2}. If you're assigning the result of a computation, then the object only exists once, so z - 0+1 gets NAMED==1. However, if the computation is done by returning a named value from within a function, as in f - function(){v - 1+0; v} z - f() then again NAMED==2. This is because the side effects of the function _might_ result in something having a hold on the function environment, e.g. if we had e - NULL f - function(){e -environment(); v - 1+0; v} z - f() then z[1] - 5 would change e$v too. As it happens, there aren't any side effects in the forme case, but R loses track and assumes the worst. Thanks a lot, think I follow. That explains x vs y, but why is z NAMED==2? The result of data.frame() is an object that exists once (similar to 1:10) so shouldn't it be NAMED==1 too? Or, R loses track and assumes the worst even on its own functions such as data.frame()? R loses track. I suspect that is really all it can do without actual reference counting. The function data.frame is more than 150 lines of code, and if any of those end up invoking user code, possibly via a class method, you can't tell definitively whether or not the evaluation environment dies at the return. Ohhh, think I see now. After Duncan's reply I was going to ask if it was possible to change data.frame() to be primitive so it could set NAMED=1. But it seems primitive functions can't use R code so data.frame() would need to be ported to C. Ok! - not quick or easy, and not without consideable risk. And, data.frame() can invoke user code inside it anyway then. Since list() is primitive I tried to construct a data.frame starting with list() [since structure() isn't primitive], but then merely adding an attribute seems to set NAMED==2 too ? Yes, because attr(x,y) - z is the same as `*tmp*` - x x - `attr-`(`*tmp*`, y, z) rm(`*tmp*`) so there are two references to the data frame: one in DF and one in `*tmp*`. It is the first line that causes the NAMED bump. And, yes, it's real: `f-`=function(x,value) { print(ls(parent.frame())); x-value } x=1 f(x)=1 [1] *tmp* f- x You could skip that by using the function directly (I don't think it's recommended, though): .Internal(inspect(l - list(a=1))) @1028c82f8 19 VECSXP g0c1 [NAM(1),ATT] (len=1, tl=0) @1028c8268 14 REALSXP g0c1 [] (len=1, tl=0) 1 ATTRIB: @100b6e748 02 LISTSXP g0c0 [] TAG: @100843878 01 SYMSXP g0c0 [MARK,gp=0x4000] names @1028c82c8 16 STRSXP g0c1 [] (len=1, tl=0) @1009cd388 09 CHARSXP g0c1 [MARK,gp=0x21] a .Internal(inspect(`names-`(l, b))) @1028c82f8 19 VECSXP g0c1 [NAM(1),ATT] (len=1, tl=0) @1028c8268 14 REALSXP g0c1 [] (len=1, tl=0) 1 ATTRIB: @100b6e748 02 LISTSXP g0c0 [] TAG: @100843878 01 SYMSXP g0c0 [MARK,gp=0x4000] names @1028c8178 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0) @100967af8 09 CHARSXP g0c1 [MARK,gp=0x20] b .Internal(inspect(l)) @1028c82f8 19 VECSXP g0c1 [NAM(1),ATT] (len=1, tl=0) @1028c8268 14 REALSXP g0c1 [] (len=1, tl=0) 1
Re: [Rd] Efficiency of factor objects
Stavros Macrakis macrakis at alum.mit.edu writes: data.table certainly has some useful mechanisms, and I've been experimenting with it as an implementation mechanism, though it's not a drop-in substitute for factors. Also, though it is efficient for set operations between small sets and large sets, it is not very efficient for operations between two large sets As a general statement that could do with some clarification ;) data.table likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I believe) efficient for joining two large 2+ column keyed data sets because the upper bound of each row's one-sided binary search is localised in that case (by group of the previous key column). As I understand it, Stavros has a different type of 'two large datasets' : English language website data. Each set is one large vector of uniformly distributed unique strings. That appears to be quite a different problem to multiple columns of many times duplicated data. Matthew Thanks everyone, and if you do come across a relevant CRAN package, I'd be very interested in hearing about it. -s __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Contributors on R-Forge
Milan Bouchet-Valat nalimi...@club.fr wrote in message news:1319202026.9174.6.camel@milan... Le vendredi 21 octobre 2011 à 13:39 +0100, Charles Roosen a écrit : Hi, I've recently taken over maintenance for the xtable package, and have set it up on R-Forge. At the moment I'm pondering what the best way is to handle submitted patches. Basically, is it better to: 1) Be non-restrictive regarding committer status, let individuals change the code with minimal pre-commit review, and figure changes can be reviewed before release. 2) Accept patches and basically log them as issues to look at in detail before putting them in. I'd say you'd better review patches before they go in, as it would be quite ugly to fix things afterwards, right before the release. If a patch is buggy, better catch problems early instead of waiting for changes to add up: then, it will be harder to find out the origin of the bug. It also allows you to spot small issues like styling and indentation, that you wouldn't bother to fix once they've been committed. You can give people committer status, but ask them to post their patches as issues before committing. This reduces the burden imposed on the reviewer/maintainer. My view : 1) Yes, be non-restrictive but impose some ground rules : i) each commit should pass 'R CMD check' ii) each new feature or bug fix should have an associated test added to the test suite (run by R CMD check), and an item added to NEWS (by the committer). iii) all developers subscribe to the -commits list and review each commit in a timely manner when the unified diff arrives in your inbox. If something is wrong or forgotten, ask the committer to fix it there and then. Matthew Regards __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Possible to read R_StringHash from a package?
Is there any way to look at R_StringHash from a package? I've read R-Ints 1.16.1 Hiding C entry points and seen that R_StringHash is declared as extern0 in Defn.h. So it seems the answer is no. Thanks, Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Manipulating single-precision (float) arrays in .Callfunctions
Duncan Murdoch murdoch.dun...@gmail.com wrote in message news:4e259600.5070...@gmail.com... On 11-07-19 7:48 AM, Matthew Dowle wrote: Prof Brian Ripleyrip...@stats.ox.ac.uk wrote in message news:alpine.lfd.2.02.1107190640280.28...@gannet.stats.ox.ac.uk... On Mon, 18 Jul 2011, Alireza Mahani wrote: Simon, Thank you for elaborating on the limitations of R in handling float types. I think I'm pretty much there with you. As for the insufficiency of single-precision math (and hence limitations of GPU), my personal take so far has been that double-precision becomes crucial when some sort of error accumulation occurs. For example, in differential equations where boundary values are integrated to arrive at interior values, etc. On the other hand, in my personal line of work (Hierarchical Bayesian models for quantitative marketing), we have so much inherent uncertainty and noise at so many levels in the problem (and no significant error accumulation sources) that single vs double precision issue is often inconsequential for us. So I think it really depends on the field as well as the nature of the problem. The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as single-precision ones, and with 64-bit CPUs they are a single access. So the extra precision comes more-or-less for free. But, isn't it much more of the 'less free' when large data sets are considered? If a double matrix takes 3GB, it's 1.5GB in single. That might alleviate the dreaded out-of-memory error for some users in some circumstances. On 64bit, 50GB reduces to 25GB and that might make the difference between getting something done, or not. If single were appropriate, of course. For GPU too, i/o often dominates iiuc. For space reasons, is there any possibility of R supporting single precision (and single bit logical to reduce memory for logicals by 32 times)? I guess there might be complaints from users using single inappropriately (or worse, not realising we have an instable result due to single). You can do any of this using external pointers now. That will remind you that every single function to operate on such objects needs to be rewritten. It's a huge amount of work, benefiting very few people. I don't think anyone in R Core will do it. Duncan Murdoch I've been informed off list about the 'bit' package, which seems great and answers my parenthetic complaint (at least). http://cran.r-project.org/web/packages/bit/index.html Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Manipulating single-precision (float) arrays in .Call functions
Prof Brian Ripley rip...@stats.ox.ac.uk wrote in message news:alpine.lfd.2.02.1107190640280.28...@gannet.stats.ox.ac.uk... On Mon, 18 Jul 2011, Alireza Mahani wrote: Simon, Thank you for elaborating on the limitations of R in handling float types. I think I'm pretty much there with you. As for the insufficiency of single-precision math (and hence limitations of GPU), my personal take so far has been that double-precision becomes crucial when some sort of error accumulation occurs. For example, in differential equations where boundary values are integrated to arrive at interior values, etc. On the other hand, in my personal line of work (Hierarchical Bayesian models for quantitative marketing), we have so much inherent uncertainty and noise at so many levels in the problem (and no significant error accumulation sources) that single vs double precision issue is often inconsequential for us. So I think it really depends on the field as well as the nature of the problem. The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as single-precision ones, and with 64-bit CPUs they are a single access. So the extra precision comes more-or-less for free. But, isn't it much more of the 'less free' when large data sets are considered? If a double matrix takes 3GB, it's 1.5GB in single. That might alleviate the dreaded out-of-memory error for some users in some circumstances. On 64bit, 50GB reduces to 25GB and that might make the difference between getting something done, or not. If single were appropriate, of course. For GPU too, i/o often dominates iiuc. For space reasons, is there any possibility of R supporting single precision (and single bit logical to reduce memory for logicals by 32 times)? I guess there might be complaints from users using single inappropriately (or worse, not realising we have an instable result due to single). Matthew You also under-estimate the extent to which stability of commonly used algorithms relies on double precision. (There are stable single-precision versions, but they are no longer commonly used. And as Simon said, in some cases stability is ensured by using extra precision where available.) I disagree slightly with Simon on GPUs: I am told by local experts that the double-precision on the latest GPUs (those from the last year or so) is perfectly usable. See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP. Regards, Alireza -- View this message in context: http://r.789695.n4.nabble.com/Manipulating-single-precision-float-arrays-in-Call-functions-tp3675684p3677232.html Sent from the R devel mailing list archive at Nabble.com. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Manipulating single-precision (float) arrays in .Callfunctions
Duncan Murdoch murdoch.dun...@gmail.com wrote in message news:4e259600.5070...@gmail.com... On 11-07-19 7:48 AM, Matthew Dowle wrote: Prof Brian Ripleyrip...@stats.ox.ac.uk wrote in message news:alpine.lfd.2.02.1107190640280.28...@gannet.stats.ox.ac.uk... On Mon, 18 Jul 2011, Alireza Mahani wrote: Simon, Thank you for elaborating on the limitations of R in handling float types. I think I'm pretty much there with you. As for the insufficiency of single-precision math (and hence limitations of GPU), my personal take so far has been that double-precision becomes crucial when some sort of error accumulation occurs. For example, in differential equations where boundary values are integrated to arrive at interior values, etc. On the other hand, in my personal line of work (Hierarchical Bayesian models for quantitative marketing), we have so much inherent uncertainty and noise at so many levels in the problem (and no significant error accumulation sources) that single vs double precision issue is often inconsequential for us. So I think it really depends on the field as well as the nature of the problem. The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as single-precision ones, and with 64-bit CPUs they are a single access. So the extra precision comes more-or-less for free. But, isn't it much more of the 'less free' when large data sets are considered? If a double matrix takes 3GB, it's 1.5GB in single. That might alleviate the dreaded out-of-memory error for some users in some circumstances. On 64bit, 50GB reduces to 25GB and that might make the difference between getting something done, or not. If single were appropriate, of course. For GPU too, i/o often dominates iiuc. For space reasons, is there any possibility of R supporting single precision (and single bit logical to reduce memory for logicals by 32 times)? I guess there might be complaints from users using single inappropriately (or worse, not realising we have an instable result due to single). You can do any of this using external pointers now. That will remind you that every single function to operate on such objects needs to be rewritten. It's a huge amount of work, benefiting very few people. I don't think anyone in R Core will do it. Ok, thanks for the responses. Matthew Duncan Murdoch Matthew You also under-estimate the extent to which stability of commonly used algorithms relies on double precision. (There are stable single-precision versions, but they are no longer commonly used. And as Simon said, in some cases stability is ensured by using extra precision where available.) I disagree slightly with Simon on GPUs: I am told by local experts that the double-precision on the latest GPUs (those from the last year or so) is perfectly usable. See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP. Regards, Alireza -- View this message in context: http://r.789695.n4.nabble.com/Manipulating-single-precision-float-arrays-in-Call-functions-tp3675684p3677232.html Sent from the R devel mailing list archive at Nabble.com. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;). Personally, I find DT[1,V1 - 3] highly disturbing - I would expect it to evaluate to { V1 - 3; DT[1, V1] } thus returning the first element of the third column. Please see FAQ 1.1, since further below it seems to be an expectation issue about 'with' syntax, too. That said, I don't think it works, either. Taking you example and data.table form r-forge: [ snip ] as you can see, DT is not modified. Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce the non-working state I'll need some more environment information please. Also I suspect there is something quite amiss because even trivial things don't work: DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 DT[1:4,1:4] [1] 1 2 3 4 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 and 1.10. When I first saw your proposal, I thought you have rather something like within(DT, V1[1] - 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): system.time(for (i in 1:10) within(DT, V1[1] - 3)) user system elapsed 2.701 4.437 7.138 No, since 'with' is already built into data.table, I was thinking of building 'within' in, too. I'll take a look at within(). Might as well provide as many options as possible to the user to use as they wish. With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) performs reasonably: system.time(within(DT, for (i in 1:1000) V1[i] - 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Noted, thanks. That's pretty fast. Does within() on data.frame fix the original issue Ivo raised, then? If so, job done. Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... The internals
Re: [Rd] [datatable-help] speeding up perception
Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... The internals assume in various places that deep copies are made (one of the reasons NAMED setings are not propagated to sub-sturcture). The main issues are avoiding cycles and that there is no easy way to check for sharing. There may be some circumstances in which a shallow copy would be OK but making sure it would be in all cases is probably more trouble than it is worth at this point. (I've tried this in the past in a few cases and always had to back off.) Best, luke Cheers, Simon On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)? n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 0.628 0.000 0.628 n = 10 x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 20.145 1.232 21.455 With $- : n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 0.304 0.000 0.307 n = 10 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 37.586 0.388 38.161 If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column : n=10 v = 1:n system.time(for (i in 1:1000) v[1] - 42L) user
Re: [Rd] [datatable-help] speeding up perception
Simon, If you didn't install.packages() with method=source from R-Forge, that would explain (some of) it. R-Forge builds binaries once each night. This commit was long after the cutoff. Matthew Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;). Personally, I find DT[1,V1 - 3] highly disturbing - I would expect it to evaluate to { V1 - 3; DT[1, V1] } thus returning the first element of the third column. Please see FAQ 1.1, since further below it seems to be an expectation issue about 'with' syntax, too. That said, I don't think it works, either. Taking you example and data.table form r-forge: [ snip ] as you can see, DT is not modified. Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce the non-working state I'll need some more environment information please. Also I suspect there is something quite amiss because even trivial things don't work: DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 DT[1:4,1:4] [1] 1 2 3 4 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 and 1.10. When I first saw your proposal, I thought you have rather something like within(DT, V1[1] - 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): system.time(for (i in 1:10) within(DT, V1[1] - 3)) user system elapsed 2.701 4.437 7.138 No, since 'with' is already built into data.table, I was thinking of building 'within' in, too. I'll take a look at within(). Might as well provide as many options as possible to the user to use as they wish. With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3)) performs reasonably: system.time(within(DT, for (i in 1:1000) V1[i] - 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Noted, thanks. That's pretty fast. Does within() on data.frame fix the original issue Ivo raised, then? If so, job done. Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS : o Fast update is now implemented, FR#200. DT[i,j]-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname-value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=10,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[1,1] - 3) user system elapsed 287.730 323.196 613.453 system.time(for (i in 1:1000) DT[1,V1 - 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : *** ** Within syntax is presently highly experimental. ** *** http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: On Wed, 6 Jul 2011, Simon Urbanek wrote: Interesting, and I stand corrected: x = data.frame(a=1:n,b=1:n) .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[1,1]=42L .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,... x[[1]][1]=42L .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking
Re: [Rd] Suggestions for R-devel / R-help digest format
Don't most people use a newsreader? For example, pointed to here : gmane.comp.lang.r.general gmane.comp.lang.r.devel IIUC, NNTP downloads headers only, when you open any post it downloads the body at that point. So it's more efficient than email (assuming you don't open every single post). I guess RSS is similar/better. Newsreaders handle threading and you can watch/ignore threads easily. Actually subscribing via email? The only reason I am subscribed is to post unmoderated (and to encourage Martin with +1 on his subscriber count); I have email delivery turned off in the mailman settings. Thought everyone did that! If I counted correctly, there are 36 gmane mirrors for various packages and sigs. You can watch all these (including r-devel and r-help) via gmane without needing to subscribe on mailman at all. Matthew Saravanan saravanan.thirumuruganat...@gmail.com wrote in message news:4e160850.1040...@gmail.com... Thanks Steve and Brian ! Probably, I will create a gmail account for mailing lists and let it take care of the threading. Regards, Saravanan On 07/07/2011 12:02 PM, Brian G. Peterson wrote: On Thu, 2011-07-07 at 11:44 -0500, Saravanan wrote: Hello, I am passive reader of both R-devel and R-help mailing lists. I am sending the following comments to r-devel as it seemed more suitable. I am aware that this list uses GNU mailman for the list management. I have my options set that it sends a email digest. One thing I find is that the digest consists of emails that ordered temporarlly. For eg lets say there are two threads t1 and t2 and the emails arrive as e1 of t1, e2 of t2, e3 of t3 . The digest lists them as e1,e2 and then e3. Is it possible to somehow configure it as T1 : e1,e3 and then T2 : e2 ? This is the digest format that google groups uses which is incredibly helpful as you can read all the messages in a thread. Additionally, it also helpfully includes a header that lists all the threads in digest so that you can jump to the one you are interested in. I checked the mailman options but could not find any. Does anyone else have the same issue? It is not a big issue in R-devel but R-help is a much more high traffic mailing list. I am interested in hearing how you read/filter your digest mails in either R-help or other high volume mailing lists. This really has nothing to do with R, but rather mailman. I use folders, filtered on the server using SIEVE and/or procmail. No digest required. I get the mails immediately, not later in the day or the next day, and can use all my various email clients easily to read/respond. mailman supports a MIME digest format that includes a table of contents with links to each MIME part. mailman does not support a threaded digest, to the best of my knowledge. Regards, - Brian __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [datatable-help] speeding up perception
On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue. David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns. If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)? n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 0.628 0.000 0.628 n = 10 x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 20.145 1.232 21.455 With $- : n = 1000 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 0.304 0.000 0.307 n = 10 x = data.frame(a=1:n,b=1:n) system.time(for (i in 1:1000) x$a[1] - 42L) user system elapsed 37.586 0.388 38.161 If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column : n=10 v = 1:n system.time(for (i in 1:1000) v[1] - 42L) user system elapsed 0.016 0.000 0.017 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L}) user system elapsed 1.816 1.076 2.900 Finally, increasing the number of columns, again only the 1st is assigned to : n=10 x = data.frame(rep(list(1:n),100)) dim(x) [1] 10100 system.time(for (i in 1:1000) x[1,1] - 42L) user system elapsed 167.974 50.903 219.711 Cheers, Simon Sent from my iPhone On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote: On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu luke-tier...@uiowa.edu wrote: On Tue, 5 Jul 2011, Simon Urbanek wrote: On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to `*tmp*` - x x - `[-`(`*tmp*`, i, j, value) rm(`*tmp*`) so there is always a copy involved. Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on. Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way). I don't believe dispatch is bumping NAMED (and a quick experiment seems to confirm this though I don't guarantee I did that right). The issue is that a replacement function implemented as a closure, which is the only option for a package, will always see NAMED on the object to be modified as 2 (because the value is obtained by forcing the argument promise) and so any R level assignments will duplicate. This also isn't really an issue of imprecise reference counting -- there really are (at least) two legitimate references -- one though the argument and one through the caller's environment. It would be good it we could come up with a way for packages to be able to define replacement functions that do not duplicate in cases where we really don't want them to, but this would require coming up with some sort of protocol, minimally involving an efficient way to detect whether a replacement funciton is being called in a replacement context or directly. Would $- always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame. -- There are some replacement functions that use C code to cheat, but these may create problems if called directly, so I won't advertise them. Best, luke Cheers
Re: [Rd] [datatable-help] speeding up perception
Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least flagged. For example, I wonder if modest special internal code could store data frames internally and transparently as lists of vectors UNTIL a row name is assigned to. Easier and uglier, a simple but specific warning message could be issued with a suggestion if there is an individual read/write into a data frame (Warning: data frames are much slower than lists of vectors for individual element access). I would also suggest changing the Introduction to R 6.3 from A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. to A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. However, data frames can be much slower than matrices or even lists of vectors (which, like data frames, can contain different types of columns) when individual elements need to be accessed. Reading about it immediately upon introduction could flag the problem in a more visible manner. regards, /iaw __
Re: [Rd] [datatable-help] speeding up perception
Simon (and all), I've tried to make assignment as fast as calling `[-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[-' from copying x? Small reproducible example in vanilla R 2.13.0 : x = list(a=1:1,b=1:1) class(x) = newclass [-.newclass = function(x,i,j,value) x # i.e. do nothing tracemem(x) [1] 0xa1ec758 x[1,2] = 42L tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why? I've tried returning NULL from [-.newclass but then x gets assigned NULL : [-.newclass = function(x,i,j,value) NULL x[1,2] = 42L tracemem[0xa1ec558 - 0x9c5f318]: x NULL Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: Simon, Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for : i) convenience of new users who don't know how to vectorize yet ii) more complex examples which can't be vectorized. Before: system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 12.792 0.488 13.340 After : system.time(for (r in 1:R) DT[r,20] - 1.0) user system elapsed 2.908 0.020 2.935 Where this can be reduced further as follows : system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0)) user system elapsed 0.132 0.000 0.131 Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ... Matthew On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: Timothée, On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly. I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative. Cheers, Simon On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote: thank you, simon. this was very interesting indeed. I also now understand how far out of my depth I am here. fortunately, as an end user, obviously, *I* now know how to avoid the problem. I particularly like the as.list() transformation and back to as.data.frame() to speed things up without loss of (much) functionality. more broadly, I view the avoidance of individual access through the use of apply and vector operations as a mixed IQ test and knowledge test (which I often fail). However, even for the most clever, there are also situations where the KISS programming principle makes explicit loops still preferable. Personally, I would have preferred it if R had, in its standard statistical data set data structure, foregone the row names feature in exchange for retaining fast direct access. R could have reserved its current implementation with row names but slow access for a less common (possibly pseudo-inheriting) data structure. If end users commonly do iterations over a data frame, which I would guess to be the case, then the impression of R by (novice) end users could be greatly enhanced if the extreme penalties could be eliminated or at least flagged. For example, I wonder if modest special internal code could store data frames internally and transparently as lists of vectors UNTIL a row name is assigned to. Easier and uglier, a simple but specific warning message could be issued with a suggestion
[Rd] help.request() for packages?
Hi, Have I missed something, or misunderstood? The r-help posting guide asks users to contact the package maintainer : If the question relates to a contributed package, e.g., one downloaded from CRAN, try contacting the package maintainer first. [snip] ONLY [only is bold font] send such questions to R-help or R-devel if you get no reply or need further assistance. This applies to both requests for help and to bug reports. but the R-ext guide contains : The mandatory ‘Maintainer’ field should give a single name with a valid (RFC 2822) email address in angle brackets (for sending bug reports etc.). It should not end in a period or comma. For a public package it should be a person, not a mailing list and not a corporate entity: do ensure that it is valid and will remain valid for the lifetime of the package. Currently, data.table contains the datatable-help mailing list in the 'Maintainer' field, with the posting guide in mind (and service levels for users). This mailing list is where we would like users to ask questions about the package, not r-help, and not a single person. However, R-exts says that the 'Maintainer' email address should not be a mailing list. There seems to be two requirements: i) a non-bouncing email address that CRAN maintainers can use - more like the 'Administrator' of the package ii) a support address for users to send questions and bug reports The BugReports field in DESCRIPTION is for bugs only, and allows only a URL, not an email address. bug.reports() has a 'package' argument and emails the Maintainer field if the BugReports URL is not provided by the package. So, BugReports seems close, but not quite what we'd like. help.request() appears to have no package argument (I checked R 2.13.0). Could a Support field (or better name) be added to DESCRIPTION, and a 'package' argument added to help.request() which uses it? Then the semantics of the Maintainer field can be closer to what the CRAN maintainers seem to think of it; i.e., the package 'Administrator'. Have I misunderstood or missed an option? Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] method=radix in sort.list() isn't actually a radix sort
Dear list, Were you aware that, strictly speaking, do_radixsort in sort.c actually implements a counting sort, not a radix sort ? http://en.wikipedia.org/wiki/Counting_sort It it was a radix sort it wouldn't need the 100,000 range restriction. Clearly the method argument can't be changed (now) from radix to counting, but perhaps a note could be added to the .Rd ? According to Wikipedia, Harold H. Seward created both counting and radix sorting in 1954, and they are distinctly different. I did a grep through all R source for the keyword radix in case this was already documented. A google search and rseek.org search didn't return results for counting sort in the R context. There appears to be scope to add (true) radix sorting to R then, that doesn't have the 100,000 range restriction. Is there any interest in that? Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] match function causing bad performance when usingtablefunction on factors with multibyte characters on Windows
I don't know if that's enough to flip the UTF8 switches internally in R. If it is enough, then this result may show I'm barking up the wrong tree. Hopefully someone from core is watching who knows. Is it feasible that you run R using an alias, and for some reason the alias is not picking up your shell variables. Best to rule that out now by running sessionInfo() at the R prompt. Otherwise do you know profiling tools sufficiently to trace the problem at the C level as it runs on Windows? Matthew Karl Ove Hufthammer k...@huftis.org wrote in message news:ihm9qq$9ej$1...@dough.gmane.org... Matthew Dowle wrote: I'm not sure, but note the difference in locale between Linux (UTF-8) and Windows (non UTF-8). As far as I understand it R much prefers UTF-8, which Windows doesn't natively support. Otherwise you could just change your Windows locale to a UTF-8 locale to make R happier. [...] If anybody knows a way to trick R on Linux into thinking it has an encoding similar to Windows then I may be able to take a look if I can reproduce the problem in Linux. Changing the locale to an ISO 8859-1 locale, i.e.: export LC_ALL=en_US.ISO-8859-1 export LANG=en_US.ISO-8859-1 I could *not* reproduce it; that is, 'table' is as fast on the non-ASCII factor as it is on the ASCII factor. -- Karl Ove Hufthammer __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
Thanks Simon! I can reproduce this on Linux now, too. locale -a didn't show en_US.iso88591 for me so I needed 'sudo locale-gen en_US' first. Then running R with $ LANG=en_US.ISO-8859-1 R is enough to reproduce the problem. Karl - can you use tabulate instead as Simon suggests? Matthew -- View this message in context: http://r.789695.n4.nabble.com/match-function-causing-bad-performance-when-using-table-function-on-factors-with-multibyte-characters-tp3229526p3237228.html Sent from the R devel mailing list archive at Nabble.com. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
I'm not sure, but note the difference in locale between Linux (UTF-8) and Windows (non UTF-8). As far as I understand it R much prefers UTF-8, which Windows doesn't natively support. Otherwise you could just change your Windows locale to a UTF-8 locale to make R happier. My stab in the dark would be that the poor performance on Windows in this case may be down to many calls to translateCharUTF8 internally. There was a change in R 2.12.0 in this area. Running your test in R 2.11.1 on Windows shows the same problem though so it doesn't look like that change caused this problem. From NEWS 2.12.0 : o unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in 'unique.c' If anybody knows a way to trick R on Linux into thinking it has an encoding similar to Windows then I may be able to take a look if I can reproduce the problem in Linux. Matthew Karl Ove Hufthammer k...@huftis.org wrote in message news:ihbko3$efs$1...@dough.gmane.org... [I originally posted this on the R-help mailing list, and it was suggested that R-devel would be a better place to dicuss it.] Running 'table' on a factor with levels containing non-ASCII characters seems to result in extremely bad performance on Windows. Here's a simple example with benchmark results (I've reduced the number of replications to make the function finish within reasonable time): library(rbenchmark) x.num=sample(1:2, 10^5, replace=TRUE) x.fac.ascii=factor(x.num, levels=1:2, labels=c(A,B)) x.fac.nascii=factor(x.num, levels=1:2, labels=c(Æ,Ø)) benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 ) test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 201.53 4.636364 1.51 0.01 NANA 2 table(x.fac.ascii) 200.33 1.00 0.33 0.00 NANA 3 table(x.fac.nascii) 20 146.67 444.454545 38.52 81.74 NANA 1 table(x.num) 201.55 4.696970 1.53 0.01 NANA sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 LC_CTYPE=Norwegian-Nynorsk_Norway.1252 LC_MONETARY=Norwegian-Nynorsk_Norway.1252 [4] LC_NUMERIC=C LC_TIME=Norwegian-Nynorsk_Norway.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] rbenchmark_0.3 The timings are from R 2.12.1, but I also get comparable results on the latest prelease (R 2.13.0 2011-01-18 r54032). Running the same test (100 replications) on a Linux system with R.12.1 Patched results in essentially no difference between the performance on ASCII factors and non-ASCII factors: test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 100 4.607 3.096102 4.455 0.092 0 0 2 table(x.fac.ascii) 100 1.488 1.00 1.459 0.028 0 0 3 table(x.fac.nascii) 100 1.616 1.086022 1.560 0.051 0 0 1 table(x.num) 100 4.504 3.026882 4.403 0.079 0 0 sessionInfo() R version 2.12.1 Patched (2011-01-18 r54033) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C LC_TIME=nn_NO.UTF-8 [4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8 [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rbenchmark_0.3 Profiling the 'table' function indicates almost all the time if spent in the 'match' function, which is used when 'factor' is used on a 'factor' inside 'table'. Indeed, 'x.fac.nascii = factor(x.fac.nascii)' by itself is extremely slow. Is there any theoretical reason 'factor' on 'factor' with non-ASCII characters must be so slow? And why doesn't this happen on Linux? Perhaps a fix for 'table' might be calculating the 'table' statistics *including* all levels (not using the 'factor' function anywhere), and then removing the 'exclude' levels in the end. For example, something along these lines: res = table.modified.to.not.use.factor(...) ind = lapply(dimnames(res), function(x) !(x %in% exclude)) do.call([, c(list(res), ind, drop=FALSE)) (I haven't tested this very much, so there may
Re: [Rd] reliability of R-Forge? (moving to r-Devel)
Spencer and David, My experience of R-Forge : i) SVN access and project management web pages have been *very* reliable all this year ... up until the weekend. This week was the first time I ever saw R-Forge Could Not Connect to Database. ii) The nightly build and checks have been consistently unreliable all year. At best the nightly build is a few days behind the latest commit, but they are working on it. This isn't as critical as (i) though since users can install from source: install.packages(pkg,type=source,repos=http://R-Forge.R-project.org;). iii) Mailing lists have been down since the weekend and I too have been mailing r-fo...@r-project.org with no response. That is *very* unusual; first time. Hope that helps to put it into context at least. Matthew P.S. I notice that R-Forge appears to be back up now, including the mailing lists. Spencer Graves spencer.gra...@structuremonitoring.com wrote in message news:4c762b50.7000...@structuremonitoring.com... Hello: Can anyone comment on plans for R-Forge? Please see thread below. Ramsay, Hooker and I would like to release a new version of fda to CRAN. We committed changes for it last Friday. I'd like to see reports of their daily checks, then submit to CRAN from R-Forge. Unfortunately, it seems to be down now, saying R-Forge Could Not Connect to Database:. I just tried, 'install.packages(fda, repos=http://R-Forge.R-project.org;)', and got the previous version, which indicates that my changes from last Friday have not been built yet. Also, a few days ago, I got an error from 'install.packages(pfda, repos=http://R-Forge.R-project.org;)' (a different package, 'pfda' NOT 'fda'). I don't remember the error message, but this same command worked for me just now. I infer from this that I should consider submitting the latest version of 'fda' to CRAN manually, not waiting for the R-Forge [formerly] daily builds and checks. R-Forge is an incredibly valuable resource. It would be even more valuable if it were more reliable. I very much appreciate the work of the volunteers who maintain it; I am unfortunately not in a position to volunteer to do more for the R-Project generally and R-Forge in particular than I already do. Thanks, Spencer Graves On 8/26/2010 1:07 AM, Jari Oksanen wrote: David Kanedaveat kanecap.com writes: How reliable is R-Forge? http://r-forge.r-project.org/ It is down now (for me). Reporting R-Forge Could Not Connect to Database: I have just started to use it for a project. It has been down for several hours (at least) on different occasions over the last couple of days. Is that common? Will it be more stable soon? Apologies if this is not an appropriate question for R-help. Dave, This is rather a subject for R-devel. Ignoring this inappropriateness: yes, indeed, R-Forge has been flaky lately. The database was disconnected for the whole weekend, came back on Monday, and is gone again. It seems that mailing lists and email alerts of commits were not working even when the basic R-Forge was up. I have sent two messages to r-fo...@r-project.org on these problems. I haven't got no response, but soon after the first message the Forge woke up, and soon after the second message it went down. Since I'm not Bayeasian, I don't know what to say about the effect of my messages. Cheers, Jari Oksanen __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Spencer Graves, PE, PhD President and Chief Operating Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San José, CA 95126 ph: 408-655-4567 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Non-blocking Eval
There is a video demo of exactly that on the data.table homepage : http://datatable.r-forge.r-project.org/ http://www.youtube.com/watch?v=rvT8XThGA8o However, last time I looked, svSocket uses text transfer. It would be really great if it did binary serialization, like Rserve does. Previous threads : http://r.789695.n4.nabble.com/Using-svSocket-with-data-table-tp924554p924554.html http://r.789695.n4.nabble.com/Video-demo-of-using-svSocket-with-data-table-tp893671p893672.html This one contains a comparison of Rserve and svSocket : http://r.789695.n4.nabble.com/Fwd-Re-Video-demo-of-using-svSocket-with-data-table-tp903723p903723.html Best, Matthew Philippe Grosjean phgrosj...@sciviews.org wrote in message news:4c629ab7.60...@sciviews.org... Hello, For non-blocking access to R through sockets, you should also look at svSockets. It may be more appropriate than RServer for feeding data to R, while you have another process running in R that do smething like updating a graph, or some other calculations. Best, Philippe Grosjean On 20/07/10 14:10, Martin Kerr wrote: Sorry I phrased that badly. What I'm trying to do is asynchronously add data to R, i.e. a program will periodically dump some readings to the Rserver and then later on another program will run some analysis scripts on them. I have managed to add the data via CMD_detachedVoidEval as you suggested. How exactly do I go about attaching to the session again? I know it involves some form of session key that comes back from the detach call, but what from does it take? And how do I use this? Thanks AgainMartin Subject: Re: [Rd] Non-blocking Eval From: simon.urba...@r-project.org Date: Mon, 19 Jul 2010 11:34:29 -0400 CC: r-devel@r-project.org To: mk2...@hotmail.com On Jul 19, 2010, at 10:58 AM, Martin Kerr wrote: Hello, I'm currently working with the C++ version of the Rserve Client as part of a student project. Is there an implementation of a non-blocking interface to Rserve in C++? I can find one via the Java JRI but no equivalent in C++. (Please note that stats-rosuda-devel is the correct list for this.) I'm not quite sure what you mean, because in JRI there is idleEval() which is non-blocking in the sense that it doesn't do anything if R is busy but that doesn't apply to Rserve as by definition R cannot be busy there. There is no non-blocking interface to JRI - all calls are synchronous. If your question is whether you can start an evaluation in Rserve and not wait for the result then there is CMD_detachedVoidEval in Rserve, but the C++ client only implements a subset of the API which does not include that -- however, it is trivial to implement (just send a request with CMD_detachedVoidEval as there is nothing to decode). Cheers, Simon _ Do you have a story that started on Hotmail? Tell us now [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
Is this a thumbs up for memcpy for DUPLICATE_ATOMIC_VECTOR at least ? If there is further specific testing then let me know, happy to help, but you seem to have beaten me to it. Matthew Simon Urbanek simon.urba...@r-project.org wrote in message news:65d21b93-a737-4a94-bdf4-ad7e90518...@r-project.org... On Apr 21, 2010, at 2:15 PM, Seth Falcon wrote: On 4/21/10 10:45 AM, Simon Urbanek wrote: Won't that miss the last incomplete chunk? (and please don't use DATAPTR on INTSXP even though the effect is currently the same) In general it seems that the it depends on nt whether this is efficient or not since calls to short memcpy are expensive (very small nt that is). I ran some empirical tests to compare memcpy vs for() (x86_64, OS X) and the results were encouraging - depending on the size of the copied block the difference could be quite big: tiny block (ca. n = 32 or less) - for() is faster small block (n ~ 1k) - memcpy is ca. 8x faster as the size increases the gap closes (presumably due to RAM bandwidth limitations) so for n = 512M it is ~30%. Of course this is contingent on the implementation of memcpy, compiler, architecture etc. And will only matter if copying is what you do most of the time ... Copying of vectors is something that I would expect to happen fairly often in many applications of R. Is for() faster on small blocks by enough that one would want to branch based on size? Good question. Given that the branching itself adds overhead possibly not. In the best case for() can be ~40% faster (for single-digit n) but that means billions of copies to make a difference (since the operation itself is so fast). The break-even point on my test machine is n=32 and when I added the branching it took 20% hit so I guess it's simply not worth it. The only case that may be worth branching is n:1 since that is likely a fairly common use (the branching penalty in copy routines is lower than comparing memcpy/for implementations since the branching can be done before the outer for loop so this may vary case-by-case). Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion how to use memcpy in duplicate.c
Just to add some clarification, the suggestion wasn't motivated by speeding up a length 3 vector being recycled 3.3 million times. But its a good point that any change should not make that case slower. I don't know how much vectorCopy is called really, DUPLICATE_ATOMIC_VECTOR seems more significant, which doesn't recycle, and already had the FIXME next to it. Where copyVector is passed a large source though, then memcpy should be faster than any of the methods using a for loop through each element (whether recycling or not), allowing for the usual caveats. What are the timings like if you repeat the for loop 100 times to get a more robust timing ? It needs to be a repeat around the for loop only, not the allocVector whose variance looks to be included in those timings below. Then increase the size of the source vector, and compare to memcpy. Matthew William Dunlap wdun...@tibco.com wrote in message news:77eb52c6dd32ba4d87471dcd70c8d70002ce6...@na-pa-vbe03.na.tibco.com... If I were worried about the time this loop takes, I would avoid using i%nt. For the attached C code compile with gcc 4.3.3 with -O2 I get # INTEGER() in loop system.time( r1 - .Call(my_rep1, 1:3, 1e7) ) user system elapsed 0.060 0.012 0.071 # INTEGER() before loop system.time( r2 - .Call(my_rep2, 1:3, 1e7) ) user system elapsed 0.076 0.008 0.086 # replace i%src_length in loop with j=0 before loop and #if(++j==src_length) j=0 ; # in the loop. system.time( r3 - .Call(my_rep3, 1:3, 1e7) ) user system elapsed 0.024 0.028 0.050 identical(r1,r2) identical(r2,r3) [1] TRUE The C code is: #define USE_RINTERNALS /* pretend we are in the R kernel */ #include R.h #include Rinternals.h SEXP my_rep1(SEXP s_src, SEXP s_dest_length) { int src_length = length(s_src) ; int dest_length = asInteger(s_dest_length) ; int i,j ; SEXP s_dest ; PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ; for(i=0;idest_length;i++) { INTEGER(s_dest)[i] = INTEGER(s_src)[i % src_length] ; } UNPROTECT(1) ; return s_dest ; } SEXP my_rep2(SEXP s_src, SEXP s_dest_length) { int src_length = length(s_src) ; int dest_length = asInteger(s_dest_length) ; int *psrc = INTEGER(s_src) ; int *pdest ; int i ; SEXP s_dest ; PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; pdest = INTEGER(s_dest) ; if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ; /* end of boilerplate */ for(i=0;idest_length;i++) { pdest[i] = psrc[i % src_length] ; } UNPROTECT(1) ; return s_dest ; } SEXP my_rep3(SEXP s_src, SEXP s_dest_length) { int src_length = length(s_src) ; int dest_length = asInteger(s_dest_length) ; int *psrc = INTEGER(s_src) ; int *pdest ; int i,j ; SEXP s_dest ; PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; pdest = INTEGER(s_dest) ; if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ; /* end of boilerplate */ for(j=0,i=0;idest_length;i++) { *pdest++ = psrc[j++] ; if (j==src_length) { j = 0 ; } } UNPROTECT(1) ; return s_dest ; } Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On Behalf Of Romain Francois Sent: Wednesday, April 21, 2010 12:32 PM To: Matthew Dowle Cc: r-de...@stat.math.ethz.ch Subject: Re: [Rd] suggestion how to use memcpy in duplicate.c Le 21/04/10 17:54, Matthew Dowle a écrit : From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; or at least with something like this: int* p_s = INTEGER(s) ; int* p_t = INTEGER(t) ; for( i=0 ; i ns ; i++){ p_s[i] = p_t[i % nt]; } since expanding the INTEGER macro over and over has a price. and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__
[Rd] suggestion how to use memcpy in duplicate.c
From copyVector in duplicate.c : void copyVector(SEXP s, SEXP t) { int i, ns, nt; nt = LENGTH(t); ns = LENGTH(s); switch (TYPEOF(s)) { ... case INTSXP: for (i = 0; i ns; i++) INTEGER(s)[i] = INTEGER(t)[i % nt]; break; ... could that be replaced with : case INTSXP: for (i=0; ins/nt; i++) memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), nt*sizeof(int)); break; and similar for the other types in copyVector. This won't help regular vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR macro, see next suggestion below, but it should help copyMatrix which calls copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and dounzip.c (once). For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it : FIXME: surely memcpy would be faster here? which seems to refer to the for loop : else { \ int __i__; \ type *__fp__ = fun(from), *__tp__ = fun(to); \ for (__i__ = 0; __i__ __n__; __i__++) \ __tp__[__i__] = __fp__[__i__]; \ } \ Could that loop be replaced by the following ? else { \ memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \ }\ In the data.table package, dogroups.c uses this technique, so the principle is tested and works well so far. Are there any road blocks preventing this change, or is anyone already working on it ? If not then I'll try and test it (on Ubuntu 32bit) and submit patch with timings, as before. Comments/pointers much appreciated. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Suggestion to add crantastic to resources section on posting guide
Under the further resources section I'd like to suggest the following addition : * http://crantastic.org/ lists popular packages according to other users votes. Consider briefly reviewing the top 30 packages before posting to r-help since someone may have already released a package that solves your problem. Thats just a straw man idea so I hope there will be answer, or discussion, either way. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] shash in unique.c
I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No response resulted in a retry() after a Sys.sleep(10 days). If its a yes or maybe then I could proceed to try it, test it, and present the test results and timings to you along with the patch. It would be on 32bit Ubuntu first, and I would need to either buy, rent time on, or borrow a 64bit machine to be able to then test there, owing to the nature of the suggestion. If its no, bad idea because... or we were already working on it, or better, then I won't spend any more time on it. Matthew Matthew Dowle mdo...@mdowle.plus.com wrote in message news:hlu4qh$l7...@dough.gmane.org... Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes sense to hash the pointer itself rather than the string it points to? In other words could the SEXP pointer be cast to unsigned int and the usual scatter be called on that as if it were integer? shash would look like a slightly modified version of ihash like this : static int shash(SEXP x, int indx, HashData *d) { if (STRING_ELT(x,indx) == NA_STRING) return 0; return scatter((unsigned int) (STRING_ELT(x,indx), d); } rather than its current form which appears to hash the string it points to : static int shash(SEXP x, int indx, HashData *d) { unsigned int k; const char *p; if(d-useUTF8) p = translateCharUTF8(STRING_ELT(x, indx)); else p = translateChar(STRING_ELT(x, indx)); k = 0; while (*p++) k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */ return scatter(k, d); } Looking at sequal, below, and reading its comments, if the pointers are equal it doesn't look at the strings they point to, which lead to the question above. static int sequal(SEXP x, int i, SEXP y, int j) { if (i 0 || j 0) return 0; /* Two strings which have the same address must be the same, so avoid looking at the contents */ if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1; /* Then if either is NA the other cannot be */ /* Once all CHARSXPs are cached, Seql will handle this */ if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING) return 0; return Seql(STRING_ELT(x, i), STRING_ELT(y, j)); } Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] shash in unique.c
Thanks a lot. Quick and brief responses below... Duncan Murdoch murd...@stats.uwo.ca wrote in message news:4b90f134.6070...@stats.uwo.ca... Matthew Dowle wrote: I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No response resulted in a retry() after a Sys.sleep(10 days). If its a yes or maybe then I could proceed to try it, test it, and present the test results and timings to you along with the patch. It would be on 32bit Ubuntu first, and I would need to either buy, rent time on, or borrow a 64bit machine to be able to then test there, owing to the nature of the suggestion. If its no, bad idea because... or we were already working on it, or better, then I won't spend any more time on it. Matthew Matthew Dowle mdo...@mdowle.plus.com wrote in message news:hlu4qh$l7...@dough.gmane.org... Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes sense to hash the pointer itself rather than the string it points to? In other words could the SEXP pointer be cast to unsigned int and the usual scatter be called on that as if it were integer? Two negative but probably not fatal issues: Pointers and ints are not always the same size. In Win64, ints are 32 bits, pointers are 64 bits. (Can we be sure there is some integer type the same size as a pointer? I don't know, ask a C expert.) No we can't be sure. But we could test at runtime, and if the assumption wasn't true, then revert to the existing method. We might want to save the hash to disk. On restore, the pointer based hash would be all wrong. (I don't know if we actually do ever save a hash to disk. ) The hash table in unique.c appears to be a temporary private hash, different to the global R_StringHash. Its private hash appears to be used only while the call to unique runs, then free'd. Thats my understanding anyway. The suggestion is not to alter the global R_StringHash in any way at all, which is the one that might be saved to disk now or in the future. Duncan Murdoch shash would look like a slightly modified version of ihash like this : static int shash(SEXP x, int indx, HashData *d) { if (STRING_ELT(x,indx) == NA_STRING) return 0; return scatter((unsigned int) (STRING_ELT(x,indx), d); } rather than its current form which appears to hash the string it points to : static int shash(SEXP x, int indx, HashData *d) { unsigned int k; const char *p; if(d-useUTF8) p = translateCharUTF8(STRING_ELT(x, indx)); else p = translateChar(STRING_ELT(x, indx)); k = 0; while (*p++) k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */ return scatter(k, d); } Looking at sequal, below, and reading its comments, if the pointers are equal it doesn't look at the strings they point to, which lead to the question above. static int sequal(SEXP x, int i, SEXP y, int j) { if (i 0 || j 0) return 0; /* Two strings which have the same address must be the same, so avoid looking at the contents */ if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1; /* Then if either is NA the other cannot be */ /* Once all CHARSXPs are cached, Seql will handle this */ if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING) return 0; return Seql(STRING_ELT(x, i), STRING_ELT(y, j)); } Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Suggestion to add crantastic to resources section onposting guide
That appears to be an epistemic error. Some people, and I would agree it seems like an increasing number of people, clearly don't read the posting guide. However, it is impossible for anyone to know how many people do read it, do thoroughly read it and, therefore, don't ever need to post to r-help. Those people would be missing from the statistical sample of people who do post. In fact it would be very surprising indeed, assuming it is true that R is getting more popular, to not see the numbers of non-compliant posters increase. I dont believe in basing decisions upon poorly applied statistics. Especially ones that go from correlation to causation so casually. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1003050433i7f104bd4l1e1421fab0d3...@mail.gmail.com... I don't think we should be expanding the posting guide. Its already so long that no one reads it. We should be thinking of ways to cut it down to a smaller size instead. On Fri, Mar 5, 2010 at 5:52 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: Under the further resources section I'd like to suggest the following addition : * http://crantastic.org/ lists popular packages according to other users votes. Consider briefly reviewing the top 30 packages before posting to r-help since someone may have already released a package that solves your problem. Thats just a straw man idea so I hope there will be answer, or discussion, either way. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] shash in unique.c
Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes sense to hash the pointer itself rather than the string it points to? In other words could the SEXP pointer be cast to unsigned int and the usual scatter be called on that as if it were integer? shash would look like a slightly modified version of ihash like this : static int shash(SEXP x, int indx, HashData *d) { if (STRING_ELT(x,indx) == NA_STRING) return 0; return scatter((unsigned int) (STRING_ELT(x,indx), d); } rather than its current form which appears to hash the string it points to : static int shash(SEXP x, int indx, HashData *d) { unsigned int k; const char *p; if(d-useUTF8) p = translateCharUTF8(STRING_ELT(x, indx)); else p = translateChar(STRING_ELT(x, indx)); k = 0; while (*p++) k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */ return scatter(k, d); } Looking at sequal, below, and reading its comments, if the pointers are equal it doesn't look at the strings they point to, which lead to the question above. static int sequal(SEXP x, int i, SEXP y, int j) { if (i 0 || j 0) return 0; /* Two strings which have the same address must be the same, so avoid looking at the contents */ if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1; /* Then if either is NA the other cannot be */ /* Once all CHARSXPs are cached, Seql will handle this */ if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING) return 0; return Seql(STRING_ELT(x, i), STRING_ELT(y, j)); } Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Why is there no c.factor?
concat() doesn't get a lot of use How do you know? Maybe its used a lot but the users had no need to tell you what they were using. The exact opposite might in fact be the case i.e. because concat is so good in splus, you just never hear of problems with it from the users. That might be a very good sign. perhaps that model would work well for a concatenation function in R I'd be happy to test it. I'm a bit concerned about performance though given what you said about repeated recursive calls, and dispatch. Could you run the following test in s-plus please and post back the timing? If this small 100MB example was fine, then we could proceed to a 64bit 10GB test. This is quite nippy at the moment in R (1.1sec). I'd be happy with a better way as long as speed wasn't compromised. set.seed(1) L = as.vector(outer(LETTERS,LETTERS,paste,sep=)) # union set of 676 levels F = lapply(1:100, function(i) {# create 100 factors f = sample(1:100, 1*1024^2 / 4, replace=TRUE) # each factor 1MB large (262144 integers), plus small amount for the levels levels(f) = sample(L,100) # pick 100 levels from the union set class(f) = factor f }) head(F[[1]]) [1] RT DM CO JV BG KU 100 Levels: YC FO PN IL CB CY HQ ... head(F[[2]]) [1] RK PD FE SG SJ CQ 100 Levels: JV FV DX NL XB ND CY QQ ... With c.factor from data.table, as posted, placed in .GlobalEnv system.time(G - do.call(c,F)) user system elapsed 0.810.321.12 head(G) [1] RT DM CO JV BG KU# looks right, comparing to F[[1]] above 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU AV AW AX AY AZ BA BB BC BD BE BF ... ZZ G[262145:262150] [1] RK PD FE SG SJ CQ # looks right, comparing to F[[2]] above 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU AV AW AX AY AZ BA BB BC BD BE BF ... ZZ identical(as.character(G),as.character(unlist(F))) [1] TRUE So I guess this would be compared to following in splus ? system.time(G - do.call(concat, F)) or maybe its just the following : system.time(G - concat(F)) I don't have splus so I can't test that myself. William Dunlap wdun...@tibco.com wrote in message news:77eb52c6dd32ba4d87471dcd70c8d7000275b...@na-pa-vbe03.na.tibco.com... -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On Behalf Of Peter Dalgaard Sent: Friday, February 05, 2010 7:41 AM To: Hadley Wickham Cc: John Fox; r-devel@r-project.org; Thomas Lumley Subject: Re: [Rd] Why is there no c.factor? Hadley Wickham wrote: On Thu, Feb 4, 2010 at 12:03 PM, Hadley Wickham had...@rice.edu wrote: I'd propose the following: If the sets of levels of all arguments are the same, then c.factor() would return a factor with the common set of levels; if the sets of levels differ, then, as Hadley suggests, the level-set of the result would be the union of sets of levels of the arguments, but a warning would be issued. I like this compromise (as long as there was an argument to suppress the warning) If I provided code to do this, along with the warnings for ordered factors and using the optimisation suggested by Matthew, is there any member of R core would be interested in sponsoring it? Hadley Messing with c() is a bit unattractive (I'm not too happy with the other c methods either; normally c() strips attributes and reduces to the base class, and those obviously do not), but a more general concat() function has been suggested a number of times. With a suitable range of methods, this could also be used to reimplement rbind.data.frame (which, incidentally, already contains a method for concatenating factors, with several ugly warts!) Yes, c() should have been put on the deprecated list a couple of decades ago, since people expect it to do too many incompatible things. And factor should have been a virtual class, with subclasses FixedLevels (e.g., Sex) or AdHocLevels (e.g., FamilyName), so c() and [()- could do the appropriate thing in either case. Back to reality, S+ has a concat(...) function, whose comments say # This function works like c() except that names of arguments are # ignored. That is, it concatenates its arguments into a single # S vector object, without considering the names of the arguments, # in the order that the arguments are given. # # To make this function work for new classes, it is only necessary # to make methods for the concat.two function, which concatenates # two vectors; recursion will take care of the rest. concat() is not generic but it repeatedly calls concat.two(x,y), an SV4-generic that dispatches on the classes of x and y. Thus you can easily predict the class of concat(x,y,z), although it may not be the same as the class of concat(z,y,x), given suitably bizarre methods for concat.two(). concat() doesn't get a lot of use but I think the idea is sound. Perhaps
Re: [Rd] Why is there no c.factor?
A search for c.factor returns tons of hits on this topic. Heres just one of the hits from 2006, when I asked the same question : http://tolstoy.newcastle.edu.au/R/e2/devel/06/11/1137.html So it appears to be complicated and there are good reasons. Since I needed it, I created c.factor in data.table package, below. It does it more efficiently since it doesn't convert each factor to character (hence losing some of the benefit). I've been told I'm not unique in this approach and that other packages also have their own c.factor. It deliberately isn't exported. Its worked well for me over the years anyway. c.factor = function(...) { args - list(...) for (i in seq(along=args)) if (!is.factor(args[[i]])) args[[i]] = as.factor(args[[i]]) # The first must be factor otherwise we wouldn't be inside c.factor, its checked anyway in the line above. newlevels = sort(unique(unlist(lapply(args,levels ans = unlist(lapply(args, function(x) { m = match(levels(x), newlevels) m[as.integer(x)] })) levels(ans) = newlevels class(ans) = factor ans } Hadley Wickham had...@rice.edu wrote in message news:f8e6ff051002040753x33282f33l78fce9f98dc29...@mail.gmail.com... Hi all, Is there are reason that there is no c.factor method? Analogous to c.Date, I'd expect something like the following to be useful: c.factor - function(...) { factors - list(...) levels - unique(unlist(lapply(factors, levels))) char - unlist(lapply(factors, as.character)) factor(char, levels = levels) } c(factor(a), factor(b), factor(c(c, b,a)), factor(d)) # [1] a b c b a d # Levels: a b c d Hadley -- http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] wiki down?
I see the same problem. The wiki link on the R homepage doesn't seem to respond. A search of r-devel for subjects containing wiki finds this seemingly unanswered recent post. Is it known? -Matthew Ben Bolker bol...@ufl.edu wrote in message news:4b44b12a.60...@ufl.edu... __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] split.data.frame
This seems very similar to the data.table package. The 'by' argument splits the data.table by that value then executes the j expression within each subset. The package documentation talks about 'subset' and 'with' in some detail. See ?[.data.table. dt = data.table(x=1:20, y=rep(1:4,each=5) dt[,sum(x),by=y] and x has a variable called grp, what do you get? In data.table that choice is given to the user via the argument 'with' which by default is TRUE meaning you get the x inside dt. Romain Francois romain.franc...@dbmail.com wrote in message news:4b288645.3010...@dbmail.com... On 12/16/2009 12:14 AM, Peter Dalgaard wrote: Romain Francois wrote: Hello, I very much enjoy with and subset semantics for data frames and was wondering if we could have something similar with split, basically by evaluating the second argument with the data frame : I seem to recall that this idea was considered and rejected when the current split.data.frame was written (10 years ago!). The main reasons were that - it's not really THAT hard to evaluate a single splitting expression using with() or eval() Sure, this is just about convenience and laziness. - not all applications will have the splitting factor inside the df to split ( split(df[-1], df[[1]]) for a simple case) this still works - if you need a computed splitting factor, there's a risk of inadvertent variable capture. I.e., if you inside a function do grp - ...whatever... spl - split(x, grp) and x has a variable called grp, what do you get? this is a problem indeed. thanks for the reply. Romain -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://tr.im/HlX9 : new package : bibtex |- http://tr.im/Gq7i : ohloh `- http://tr.im/FtUu : new package : highlight __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using svSocket with data.table
Hi Olaf, Thanks for your feedback, much appreciated. Don't be fooled. R does not handle multiple requests in parallel internally. I wasn't fooled, but I've added some annotations to the video at the place I might have given the impression I was (at 4min 39sec). Later, at 5min30sec I did already point out that the 'graph stopped while the R server processed this clients request' but that is later. http://www.youtube.com/watch?v=rvT8XThGA8o Also I suspect that, depending on what you do on the CLI, this will interact badly with svSocket. Can you give an example to try out? Regards, Matthew Olaf Mersmann ol...@kimberly.tako.de wrote in message news:1248555172-sup-4...@bloxx.local... Hi Matthew, Excerpts from Matthew Dowle's message of Sat Jul 25 09:07:44 +0200 2009: So I'm looking to do the same as the demo, but with a binary socket. Does anyone have any ideas? I've looked a bit at Rserve, bigmemory, biocep, nws but although all those packages are great, I didn't find anything that worked in exactly this way i.e. i) R to R ii) CLI non-blocking and iii) no need to startup R in a special way Don't be fooled. R does not handle multiple requests in parallel internally. Also I suspect that, depending on what you do on the CLI, this will interact badly with svSocket. As far as binary transfer of R objects goes, you are probably looking for serialize() and unserialize(). Not sure if these are guaranteed to work across differen versions of R and different word sizes. See the Warnings section in the serialize manual page. Cheers Olaf __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] merge performace degradation in 2.9.1
Is there a way to avoid the degradation in performance in 2.9.1? If the example is to demonstrate a difference between R versions that you really need to get to the bottom of then read no further. However, if the example is actually what you want to do then you can speed it up by using a data.table as follows to reduce the 26 secs to 1 sec. Time on my PC at home (quite old now!) : system.time(Out - merge(X, Y, by=mon, all=TRUE)) user system elapsed 25.630.58 26.98 Using a data.table instead : X - data.table(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N), key=mon) Y - data.table(mon=month.abb, letter=letters[1:12], key=mon) tables() NAME NROW COLS KEY [1,] X1,200,000 group,mon mon [2,] Y 12 mon,letter mon system.time(X$letter - Y[X,letter]) # Y[X] is the syntax for merge of two data.tables user system elapsed 0.980.111.10 identical(Out$letter, X$letter) [1] TRUE identical(Out$mon, X$mon) [1] TRUE identical(Out$group, X$group) [1] TRUE To do the multi-column equi-join of X and Z, set a key of 2 columns. 'nomatch' is the equivalent of 'all' and can be set to 0 (inner join) or NA (outer join). Adrian Dragulescu adria...@eskimo.com wrote in message news:pine.lnx.4.64.0907090953580.1...@shell.eskimo.com... I have noticed a significant performance degradation using merge in 2.9.1 relative to 2.8.1. Here is what I observed: N - 10 X - data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N)) X$mon - as.character(X$mon) Y - data.frame(mon=month.abb, letter=letters[1:12]) Y$mon - as.character(Y$mon) Z - cbind(Y, group=1:12) system.time(Out - merge(X, Y, by=mon, all=TRUE)) # R 2.8.1 is 17% faster than R 2.9.1 for N=10 system.time(Out - merge(X, Z, by=c(mon, group), all=TRUE)) # R 2.8.1 is 16% faster than R 2.9.1 for N=10 Here is the head of summaryRprof() for 2.8.1 $by.self self.time self.pct total.time total.pct sort.list 4.60 56.5 4.60 56.5 make.unique 1.68 20.6 2.18 26.8 as.character0.50 6.1 0.50 6.1 duplicated.default 0.50 6.1 0.50 6.1 merge.data.frame0.20 2.5 8.02 98.5 [.data.frame0.16 2.0 7.10 87.2 and for 2.9.1 $by.self self.time self.pct total.time total.pct sort.list 4.66 39.2 4.66 39.2 nchar 3.28 27.6 3.28 27.6 make.unique 1.42 12.0 1.92 16.2 as.character0.50 4.2 0.50 4.2 data.frame 0.46 3.9 4.12 34.7 [.data.frame0.44 3.7 7.28 61.3 As you notice the 2.9.1 has an nchar entry that is quite time consuming. Is there a way to avoid the degradation in performance in 2.9.1? Thank you, Adrian As an aside, I got interested in testing merge in 2.9.1 by reading the r-devel message from 30-May-2009 Degraded performance with rank() by Tim Bergsma, as he mentions doing merges, but only today decided to test. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Can we generate exe file using R? What is the maximum file size valid?
Does Ra get close to compiled R ? The R code is compiled on the fly to bytecode which is executed internally by an interpreter in C. The timing tests look impressive. http://www.milbo.users.sonic.net/ra/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel