from:"Matthew Dowle"

[Rd] SET_NAMED in getattrib0

2013-09-26 Thread Matthew Dowle


Can someone please set me straight on why getattrib0 calls
SET_NAMED on the SEXP it returns?  For example the line :

 SET_NAMED(CAR(s), 2);

appears near the end of getattrib0 here :

https://svn.r-project.org/R/trunk/src/main/attrib.c 
https://svn.r-project.org/R/trunk/src/main/attrib.c

getattrib() is just reading the value.  Shouldn't NAMED
be bumped if and when the result of getattrib() is
bound to a symbol at R level?

Thanks,
Matthew



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] declaring package dependencies

2013-09-16 Thread Matthew Dowle


On Sep 16, 2013, at 01:46 PM, Brian Rowe wrote:


That reminds me: I once made a suggestion on how to automate some of the CRAN
deployment process, but it was shot down as not being useful to them. I do
recall a quote that was along the lines of as long as you don't need help,
do whatever you want, so one thought is to just set up a build server that
does the building across the three versions of R, checks dependencies, rebuilds
when release, patch, or devel are updated, etc. This would ease the burden on
package maintainers and would just happen to make the CRAN folks' lives easier
by catching a lot of bad builds. A proof of concept on AWS connecting to github
or rforge could probably be finished on a six-pack. Speak up if anyone thinks
this would be useful.


Yes useful. But that includes a package build system (which is what breaks on
R-Forge). If you could do that on a six-pack then could you fix R-Forge on a
three-pack first please? The R-Forge build system is itself an open source
package on R-Forge. Anyone can look at it, understand it and change it to be
more stable. That build system is here :

https://r-forge.r-project.org/R/?group_id=34

(I only know this because Stefan told me once. So I suspect others don't know
either, or it hasn't sunk in that we're pushing on an open door.)

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] declaring package dependencies

2013-09-16 Thread Matthew Dowle


Ben Bolker wrote :

Do you happen to remember what the technical difficulty was?


From memory I think it was that CRAN maintainers didn't have
access to Uwe's winbuilder machine. But often when I get OK
from winbuilder R-devel I don't want it to go to CRAN yet. So
procedures and software would have to be put in place to
handle that (unclear) logic which I didn't propose anything
for or offer any code to do. So time and effort to decide and
time and effort to implement. Just a guess. And maybe some
packages don't run on Windows, so what about those?  It's
all those edge cases that really take the time.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] helping R-forge build

2013-09-16 Thread Matthew Dowle


On 16/09/13 16:11, Paul Gilbert wrote:

(subject changed from Re: [Rd] declaring package dependencies )
...

Yes useful. But that includes a package build system (which is what
breaks on
R-Forge). If you could do that on a six-pack then could you fix R-Forge
on a
three-pack first please? The R-Forge build system is itself an open
source
package on R-Forge. Anyone can look at it, understand it and change it
to be
more stable. That build system is here :

https://r-forge.r-project.org/R/?group_id=34

(I only know this because Stefan told me once. So I suspect others don't
know
either, or it hasn't sunk in that we're pushing on an open door.)

Matthew


Open code is necessary, but to debug one needs access to logs, etc, to
see where it is breaking.  Do you know how to find that information?


There's a link at the bottom of the R-Forge page to :
  http://download.r-forge.r-project.org/STATUS
I don't know if that's enough but it's a start maybe. I've copied Stefan 
in case there are more logs somewhere else.



(And, BTW, there are also tools to help automatically build R and test
packages at http://automater.r-forge.r-project.org/ .)


automater looks good! What's the next step?



Paul



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] declaring package dependencies

2013-09-15 Thread Matthew Dowle



I'm a little surprised by this thread.

I subscribe to the RSS feeds of changes to NEWS (as Dirk mentioned) and 
that's been pretty informative in the past :

http://developer.r-project.org/RSSfeeds.html

Mainly though, I submit to winbuilder before submitting to CRAN, as the 
CRAN policies advise.  winbuilder's R-devel seems to be built daily, 
saving me the time. Since I don't have Windows it kills two birds with 
one stone.  It has caught many problems for me before submitting to CRAN 
and I can't remember it ever not responding in a reasonable time.

http://win-builder.r-project.org/upload.aspx

I've suggested before that winbuilder could be the mechanism to submit 
to CRAN rather than an ftp upload to incoming.  Only if winbuilder 
passed OK on R-devel could it then go to a human.   But iirc there was a 
technical difficulty preventing this.


Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] declaring package dependencies

2013-09-14 Thread Matthew Dowle


I'm a little surprised by this thread.

I subscribe to the RSS feeds of changes to NEWS (as Dirk mentioned) and 
that's been pretty informative in the past :
http://developer.r-project.org/RSSfeeds.html

Mainly though, I submit to winbuilder before submitting to CRAN, as the 
CRAN policies advise.  winbuilder's R-devel seems to be built daily, 
saving me the time. Since I don't have Windows it kills two birds with 
one stone.  It has caught many problems for me before submitting to CRAN 
and I can't remember it ever not responding in a reasonable time.
http://win-builder.r-project.org/upload.aspx

I've suggested before that winbuilder could be the mechanism to submit 
to CRAN rather than an ftp upload to incoming.  Only if winbuilder 
passed OK on R-devel could it then go to a human.   But iirc there was 
a technical difficulty preventing this.

Matthew



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] C API entry point to currentTime()

2013-03-26 Thread Matthew Dowle



Hi,

I used to use currentTime() (from /src/main/datetime.c) to time various 
sections of data.table C code in wall clock time in sub-second accuracy 
(type double), consistently across platforms. The consistency across 
platforms is a really nice feature of currentTime(). But currentTime() 
isn't part of R's API so I changed to clock() in order to pass R3 
checks. This is nicer in many ways but I'd still like to time elapsed 
wall clock time as well, since some of the operations are i/o bound.


Does R provide a C entry point to currentTime() (or equivalent) 
suitable for use by packages?  I searched r-devel archive and the 
manuals but may well have missed it.


Thanks, Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] double in summary.c : isum

2013-03-25 Thread Matthew Dowle


On 25.03.2013 09:20, Prof Brian Ripley wrote:

On 24/03/2013 15:01, Duncan Murdoch wrote:

On 13-03-23 10:20 AM, Matthew Dowle wrote:

On 23.03.2013 12:01, Prof Brian Ripley wrote:

On 20/03/2013 12:56, Matthew Dowle wrote:


Hi,

Please consider the following :


x = as.integer(2^30-1)

[1] 1073741823

sum(c(rep(x, 1000), rep(-x,999)))

[1] 1073741824

Tested on 2.15.2 and a recent R-devel (r62132).

I'm wondering if s in isum could be LDOUBLE instead of double, 
like

rsum, to fix this edge case?


No, because there is no guarantee that LDOUBLE differs from double
(and platform on which it does not).


That's a reason for not using LDOUBLE at all isn't it? Yet 
src/main/*.c

has 19 lines using LDOUBLE e.g. arithmetic.c and cum.c as well as
summary.c.

I'd assumed LDOUBLE was being used by R to benefit from long double 
(or
equivalent) on platforms that support it (which is all modern Unix, 
Mac
and Windows as far as I know). I do realise that the edge case 
wouldn't


Actually, you don't know.  Really only on almost all Intel ix86: most
other current CPUs do not have it in hardware.  C99/C11 require long
double, but does not require the accuracy that you are thinking of 
and

it can be implemented in software.


This is very interesting, thanks. Which of the CRAN machines don't 
support LDOUBLE with higher accuracy than double, either in hardware or 
software?  Yes I had assumed that all CRAN machines would do. It would 
be useful to know for something else I'm working on as well.



be fixed on platforms where LDOUBLE is defined as double.


I think the problem is that there are two opposing targets in R:  we
want things to be as accurate as possible, and we want them to be
consistent across platforms. Sometimes one goal wins, sometimes the
other.  Inconsistencies across platforms give false positives in 
tests
that tend to make us miss true bugs.  Some people think we should 
never
use LDOUBLE because of that.  In other cases, the extra accuracy is 
so

helpful that it's worth it.  So I think you'd need to argue that the
case you found is something where the benefit outweighs the costs. 
Since
almost all integer sums are done exactly with the current code, is 
it

really worth introducing inconsistencies in the rare inexact cases?


But as I said lower down, a 64-bit integer accumulator would be
helpful, C99/C11 requires one at least that large and it is
implemented in hardware on all known R platforms.  So there is a way
to do this pretty consistently across platforms.


That sounds much better. Is it just a matter of changing s to be 
declared as uint64_t?




Duncan Murdoch




What have I misunderstood?



Users really need to take responsibility for the numerical 
stability

of calcuations they attempt.  Expecting to sum 20 million large
numbers exactly is unrealistic.


Trying to take responsibility, but you said no. Changing from 
double to

LDOUBLE would mean that something that wasn't realistic, was then
realistic (on platforms that support long double).

And it would bring open source R into line with TERR, which gets 
the
answer right, on 64bit Windows at least. But I'm not sure I should 
be as

confident in TERR as I am in open source R because I can't see its
source code.



There are cases where 64-bit integer accumulators would be
beneficial, and this is one.  Unfortunately C11 does not require 
them

but some optional moves in that direction are planned.



https://svn.r-project.org/R/trunk/src/main/summary.c

Thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] double in summary.c : isum

2013-03-25 Thread Matthew Dowle


On 25.03.2013 11:27, Matthew Dowle wrote:

On 25.03.2013 09:20, Prof Brian Ripley wrote:

On 24/03/2013 15:01, Duncan Murdoch wrote:

On 13-03-23 10:20 AM, Matthew Dowle wrote:

On 23.03.2013 12:01, Prof Brian Ripley wrote:

On 20/03/2013 12:56, Matthew Dowle wrote:


Hi,

Please consider the following :


x = as.integer(2^30-1)

[1] 1073741823

sum(c(rep(x, 1000), rep(-x,999)))

[1] 1073741824

Tested on 2.15.2 and a recent R-devel (r62132).

I'm wondering if s in isum could be LDOUBLE instead of double, 
like

rsum, to fix this edge case?


No, because there is no guarantee that LDOUBLE differs from 
double

(and platform on which it does not).


That's a reason for not using LDOUBLE at all isn't it? Yet 
src/main/*.c

has 19 lines using LDOUBLE e.g. arithmetic.c and cum.c as well as
summary.c.

I'd assumed LDOUBLE was being used by R to benefit from long 
double (or
equivalent) on platforms that support it (which is all modern 
Unix, Mac
and Windows as far as I know). I do realise that the edge case 
wouldn't


Actually, you don't know.  Really only on almost all Intel ix86: 
most

other current CPUs do not have it in hardware.  C99/C11 require long
double, but does not require the accuracy that you are thinking of 
and

it can be implemented in software.


This is very interesting, thanks. Which of the CRAN machines don't
support LDOUBLE with higher accuracy than double, either in hardware
or software?  Yes I had assumed that all CRAN machines would do. It
would be useful to know for something else I'm working on as well.


be fixed on platforms where LDOUBLE is defined as double.


I think the problem is that there are two opposing targets in R:  
we

want things to be as accurate as possible, and we want them to be
consistent across platforms. Sometimes one goal wins, sometimes the
other.  Inconsistencies across platforms give false positives in 
tests
that tend to make us miss true bugs.  Some people think we should 
never
use LDOUBLE because of that.  In other cases, the extra accuracy is 
so
helpful that it's worth it.  So I think you'd need to argue that 
the
case you found is something where the benefit outweighs the costs. 
Since
almost all integer sums are done exactly with the current code, is 
it

really worth introducing inconsistencies in the rare inexact cases?


But as I said lower down, a 64-bit integer accumulator would be
helpful, C99/C11 requires one at least that large and it is
implemented in hardware on all known R platforms.  So there is a way
to do this pretty consistently across platforms.


That sounds much better. Is it just a matter of changing s to be
declared as uint64_t?


Typo. I meant int64_t.





Duncan Murdoch




What have I misunderstood?



Users really need to take responsibility for the numerical 
stability

of calcuations they attempt.  Expecting to sum 20 million large
numbers exactly is unrealistic.


Trying to take responsibility, but you said no. Changing from 
double to

LDOUBLE would mean that something that wasn't realistic, was then
realistic (on platforms that support long double).

And it would bring open source R into line with TERR, which gets 
the
answer right, on 64bit Windows at least. But I'm not sure I should 
be as

confident in TERR as I am in open source R because I can't see its
source code.



There are cases where 64-bit integer accumulators would be
beneficial, and this is one.  Unfortunately C11 does not require 
them

but some optional moves in that direction are planned.



https://svn.r-project.org/R/trunk/src/main/summary.c

Thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] double in summary.c : isum

2013-03-25 Thread Matthew Dowle


On 25.03.2013 11:31, Matthew Dowle wrote:

On 25.03.2013 11:27, Matthew Dowle wrote:

On 25.03.2013 09:20, Prof Brian Ripley wrote:

On 24/03/2013 15:01, Duncan Murdoch wrote:

On 13-03-23 10:20 AM, Matthew Dowle wrote:

On 23.03.2013 12:01, Prof Brian Ripley wrote:

On 20/03/2013 12:56, Matthew Dowle wrote:


Hi,

Please consider the following :


x = as.integer(2^30-1)

[1] 1073741823

sum(c(rep(x, 1000), rep(-x,999)))

[1] 1073741824

Tested on 2.15.2 and a recent R-devel (r62132).

I'm wondering if s in isum could be LDOUBLE instead of double, 
like

rsum, to fix this edge case?


No, because there is no guarantee that LDOUBLE differs from 
double

(and platform on which it does not).


That's a reason for not using LDOUBLE at all isn't it? Yet 
src/main/*.c

has 19 lines using LDOUBLE e.g. arithmetic.c and cum.c as well as
summary.c.

I'd assumed LDOUBLE was being used by R to benefit from long 
double (or
equivalent) on platforms that support it (which is all modern 
Unix, Mac
and Windows as far as I know). I do realise that the edge case 
wouldn't


Actually, you don't know.  Really only on almost all Intel ix86: 
most
other current CPUs do not have it in hardware.  C99/C11 require 
long
double, but does not require the accuracy that you are thinking of 
and

it can be implemented in software.


This is very interesting, thanks. Which of the CRAN machines don't
support LDOUBLE with higher accuracy than double, either in hardware
or software?  Yes I had assumed that all CRAN machines would do. It
would be useful to know for something else I'm working on as well.


be fixed on platforms where LDOUBLE is defined as double.


I think the problem is that there are two opposing targets in R:  
we

want things to be as accurate as possible, and we want them to be
consistent across platforms. Sometimes one goal wins, sometimes 
the
other.  Inconsistencies across platforms give false positives in 
tests
that tend to make us miss true bugs.  Some people think we should 
never
use LDOUBLE because of that.  In other cases, the extra accuracy 
is so
helpful that it's worth it.  So I think you'd need to argue that 
the
case you found is something where the benefit outweighs the costs. 
Since
almost all integer sums are done exactly with the current code, is 
it
really worth introducing inconsistencies in the rare inexact 
cases?


But as I said lower down, a 64-bit integer accumulator would be
helpful, C99/C11 requires one at least that large and it is
implemented in hardware on all known R platforms.  So there is a 
way

to do this pretty consistently across platforms.


That sounds much better. Is it just a matter of changing s to be
declared as uint64_t?


Typo. I meant int64_t.


But even 64-bit integer might under or overflow. Which is one of the 
reasons for accumulating in double (or LDOUBLE) isn't it? To save a

test for over/underflow on each iteration.







Duncan Murdoch




What have I misunderstood?



Users really need to take responsibility for the numerical 
stability

of calcuations they attempt.  Expecting to sum 20 million large
numbers exactly is unrealistic.


Trying to take responsibility, but you said no. Changing from 
double to

LDOUBLE would mean that something that wasn't realistic, was then
realistic (on platforms that support long double).

And it would bring open source R into line with TERR, which gets 
the
answer right, on 64bit Windows at least. But I'm not sure I 
should be as
confident in TERR as I am in open source R because I can't see 
its

source code.



There are cases where 64-bit integer accumulators would be
beneficial, and this is one.  Unfortunately C11 does not require 
them

but some optional moves in that direction are planned.



https://svn.r-project.org/R/trunk/src/main/summary.c

Thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] conflict between rJava and data.table

2013-03-01 Thread Matthew Dowle



Simon Urbanek wrote :
Can you elaborate on the details as of where this will be a problem? 
Packages
should not be affected since they should be importing the namespaces 
from the
packages they use, so the only problem would be in a package that 
uses both
data.table and rJava --  and this is easily resolved in the namespace 
of such

package. So there is no technical reason why you can't have multiple
definitions of J - that's what namespaces are for.


Right. It's users using J() in their own code, iiuc. rJava's manual 
says J is
the high-level access to Java.  When they use J() on its own they 
probably

want the rJava one, but if data.table is higher they get that one.
They don't want to have to write out rJava::J(...).

It is not just rJava but package XLConnect, too. If there's a better 
way would

be interested but I didn't mind removing J from data.table.

Bunny/Matt,

To add to Steve's reply here's some background. This is well documented 
in NEWS
and Googling data.table J rJava and similar returns useful links to 
NEWS and

datatable-help (so you shouldn't have needed to post to r-devel).

From 1.8.2 (Jul 2012) :

o  The J() alias is now deprecated outside DT[...], but will still work 
inside

   DT[...], as in DT[J(...)].
   J() is conflicting with function J() in package XLConnect (#1747)
   and rJava (#2045). For data.table to change is easier, with some 
efficiency
   advantages too. The next version of data.table will issue a warning 
from J()
   when used outside DT[...]. The version after will remove it. Only 
then will

   the conflict with rJava and XLConnect be resolved.
   Please use data.table() directly instead of J(), outside DT[...].

From 1.8.4 (Nov 2012) :

o  J() now issues a warning (when used *outside* DT[...]) that using it
   outside DT[...] is deprecated. See item below in v1.8.2.
   Use data.table() directly instead of J(), outside DT[...]. Or, 
define
   an alias yourself. J() will continue to work *inside* DT[...] as 
documented.


From 1.8.7 (soon to be on CRAN) :

o  The J() alias is now removed *outside* DT[...], but will still work 
inside DT[...];
   i.e., DT[J(...)] is fine. As warned in v1.8.2 (see below in this 
file) and deprecated
   with warning() in v1.8.6. This resolves the conflict with function 
J() in package

   XLConnect (#1747) and rJava (#2045).
   Please use data.table() directly instead of J(), outside DT[...].

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] conflict between rJava and data.table

2013-03-01 Thread Matthew Dowle


On 01.03.2013 16:13, Simon Urbanek wrote:

On Mar 1, 2013, at 8:03 AM, Matthew Dowle wrote:



Simon Urbanek wrote :
Can you elaborate on the details as of where this will be a 
problem? Packages
should not be affected since they should be importing the 
namespaces from the
packages they use, so the only problem would be in a package that 
uses both
data.table and rJava --  and this is easily resolved in the 
namespace of such
package. So there is no technical reason why you can't have 
multiple

definitions of J - that's what namespaces are for.


Right. It's users using J() in their own code, iiuc. rJava's manual 
says J is
the high-level access to Java.  When they use J() on its own they 
probably

want the rJava one, but if data.table is higher they get that one.
They don't want to have to write out rJava::J(...).

It is not just rJava but package XLConnect, too. If there's a better 
way would

be interested but I didn't mind removing J from data.table.



For packages there is really no issue - if something breaks in
XTConnect then the authors are probably importing the wrong function
in their namespace (I still didn't see a reproducible example,
though). The only difference is for interactive use so not having
conflicting J() [if possible] would be actually useful there, since
J() in rJava is primarily intended for interactive use.


Yes that's what I wrote above isn't it? i.e.


It's users using J() in their own code, iiuc.
J is the high-level access to Java.


Not just interactive use (i.e. at the R prompt) but inside their 
functions and scripts, too.
Although, I don't know the rJava package at all. So why J() might be 
used for interactive

use but not in functions and scripts isn't clear to me.
Any use of J from example(J) will serve as a reproducible example; 
e.g.,


library(rJava)  # load rJava first
library(data.table) # then data.table
J(java.lang.Double)

There is no error or warning, but the user would be returned a 1 row 1 
column
data.table rather than something related to Java. Then the 
errors/warnings follow from there.


The user can either load the packages the other way around, or, use ::

library(rJava)  # load rJava first
library(data.table) # then data.table
rJava::J(java.lang.Double)# ok now




Cheers,
Simon



Bunny/Matt,

To add to Steve's reply here's some background. This is well 
documented in NEWS
and Googling data.table J rJava and similar returns useful links 
to NEWS and

datatable-help (so you shouldn't have needed to post to r-devel).

From 1.8.2 (Jul 2012) :

o  The J() alias is now deprecated outside DT[...], but will still 
work inside

  DT[...], as in DT[J(...)].
  J() is conflicting with function J() in package XLConnect (#1747)
  and rJava (#2045). For data.table to change is easier, with some 
efficiency
  advantages too. The next version of data.table will issue a 
warning from J()
  when used outside DT[...]. The version after will remove it. Only 
then will

  the conflict with rJava and XLConnect be resolved.
  Please use data.table() directly instead of J(), outside DT[...].

From 1.8.4 (Nov 2012) :

o  J() now issues a warning (when used *outside* DT[...]) that using 
it

  outside DT[...] is deprecated. See item below in v1.8.2.
  Use data.table() directly instead of J(), outside DT[...]. Or, 
define
  an alias yourself. J() will continue to work *inside* DT[...] as 
documented.


From 1.8.7 (soon to be on CRAN) :

o  The J() alias is now removed *outside* DT[...], but will still 
work inside DT[...];
  i.e., DT[J(...)] is fine. As warned in v1.8.2 (see below in this 
file) and deprecated
  with warning() in v1.8.6. This resolves the conflict with function 
J() in package

  XLConnect (#1747) and rJava (#2045).
  Please use data.table() directly instead of J(), outside DT[...].

Matthew





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] conflict between rJava and data.table

2013-03-01 Thread Matthew Dowle


On 01.03.2013 20:19, Simon Urbanek wrote:

On Mar 1, 2013, at 11:40 AM, Matthew Dowle wrote:


On 01.03.2013 16:13, Simon Urbanek wrote:

On Mar 1, 2013, at 8:03 AM, Matthew Dowle wrote:



Simon Urbanek wrote :
Can you elaborate on the details as of where this will be a 
problem? Packages
should not be affected since they should be importing the 
namespaces from the
packages they use, so the only problem would be in a package that 
uses both
data.table and rJava --  and this is easily resolved in the 
namespace of such
package. So there is no technical reason why you can't have 
multiple

definitions of J - that's what namespaces are for.


Right. It's users using J() in their own code, iiuc. rJava's 
manual says J is
the high-level access to Java.  When they use J() on its own they 
probably

want the rJava one, but if data.table is higher they get that one.
They don't want to have to write out rJava::J(...).

It is not just rJava but package XLConnect, too. If there's a 
better way would

be interested but I didn't mind removing J from data.table.



For packages there is really no issue - if something breaks in
XTConnect then the authors are probably importing the wrong 
function

in their namespace (I still didn't see a reproducible example,
though). The only difference is for interactive use so not having
conflicting J() [if possible] would be actually useful there, since
J() in rJava is primarily intended for interactive use.


Yes that's what I wrote above isn't it? i.e.


It's users using J() in their own code, iiuc.
J is the high-level access to Java.


Not just interactive use (i.e. at the R prompt) but inside their 
functions and scripts, too.
Although, I don't know the rJava package at all. So why J() might be 
used for interactive

use but not in functions and scripts isn't clear to me.
Any use of J from example(J) will serve as a reproducible example; 
e.g.,


   library(rJava)  # load rJava first
   library(data.table) # then data.table
   J(java.lang.Double)

There is no error or warning, but the user would be returned a 1 row 
1 column
data.table rather than something related to Java. Then the 
errors/warnings follow from there.


The user can either load the packages the other way around, or, use 
::


   library(rJava)  # load rJava first
   library(data.table) # then data.table
   rJava::J(java.lang.Double)# ok now



Matt,

there are two entirely separate uses

a) interactive use
b) use in packages

you are describing a) and as I said in the latter part above J() in
rJava is meant for that so it would be useful to not have a conflict
there.


Yes (a) is the problem. Good, so I did the right thing in July 2012
by starting to deprecate J in data.table when this problem was first
reported.


However, in my first part of the e-mail I was referring to b) where
there is no conflict, because packages define which package will a
symbol come from, so the user search path plays no role. Today, all
packages should be using imports so search path pollution should no
longer be an issue, so the order in which the user attached packages
to their search path won't affect the functionality of the packages
(that's why namespaces are mandatory). Therefore, if XLConnect breaks
(again, I don't know, I didn't see it) due to the order on the search
path, it indicates there is a bug in the its namespace as it's
apparently importing the wrong J - it should be importing it from
rJava and not data.table. Is that more clear?


Yes, thanks. (b) isn't a problem. rJava and XLConnect aren't breaking,
the users aren't reporting that. It's merely problem (a); e.g. where
end users of both rJava and data.table use J() in their own code.



Cheers,
Simon






Cheers,
Simon



Bunny/Matt,

To add to Steve's reply here's some background. This is well 
documented in NEWS
and Googling data.table J rJava and similar returns useful links 
to NEWS and

datatable-help (so you shouldn't have needed to post to r-devel).

From 1.8.2 (Jul 2012) :

o  The J() alias is now deprecated outside DT[...], but will still 
work inside

 DT[...], as in DT[J(...)].
 J() is conflicting with function J() in package XLConnect (#1747)
 and rJava (#2045). For data.table to change is easier, with some 
efficiency
 advantages too. The next version of data.table will issue a 
warning from J()
 when used outside DT[...]. The version after will remove it. Only 
then will

 the conflict with rJava and XLConnect be resolved.
 Please use data.table() directly instead of J(), outside DT[...].

From 1.8.4 (Nov 2012) :

o  J() now issues a warning (when used *outside* DT[...]) that 
using it

 outside DT[...] is deprecated. See item below in v1.8.2.
 Use data.table() directly instead of J(), outside DT[...]. Or, 
define
 an alias yourself. J() will continue to work *inside* DT[...] as 
documented.


From 1.8.7 (soon to be on CRAN) :

o  The J() alias is now removed *outside* DT[...], but will still 
work inside

Re: [Rd] Implications of a Dependency on a GPLed Package

2013-01-25 Thread Matthew Dowle



Christian,

In my mind, rightly or wrongly, it boils down to these four points :

1. CRAN policy excludes closed source packages; i.e., every single 
package on CRAN includes its C code, if any. If an R package included a 
.dll or .so which linked at C level to R, and that was being distributed 
without providing the source, then that would be a clear breach of R's 
GPL. But nobody is aware of any such package. Anyone who is aware of 
that should let the R Foundation know. Whether or not the GPL applies to 
R only interpreted code (by definition you cannot close-source 
interpreted code) is important too, but not as important as distribution 
of closed source binaries linking to R at C level.


2. Court cases would never happen unless two lawyers disagreed. Even 
then two judges can disagree (otherwise appeals would never be 
successful).


3. There are two presidents of the R Foundation. And it appears they 
disagree. Therefore it appears very unlikely that the R Foundation would 
bring a GPL case against anyone. Rather, it seems to be up to the 
community to decide for themselves. If you don't mind about close source 
non free software linking to R at C level then buy it (if that exists), 
if you don't, don't.


4. As a package author it is entirely up to you how to approach this 
area. Yes, seek legal advice. And I'd suggest seeking the advice of 
several lawyers, not just one. Then follow the advice that you like the 
best.


Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bounty on Error Checking

2013-01-04 Thread Matthew Dowle



On Fri, Jan 3, 2013, Bert Gunter wrote

Well...

On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at 
anderson.ucla.edu wrote:


Dear R developers---I just spent half a day debugging an R program,
which had two bugs---I selected the wrongly named variable, which
turns out to have been a scalar, which then happily multiplied as if
it was a matrix; and another wrongly named variable from a data 
frame,

that triggered no error when used as a[[name]] or a$name .  there
should be an option to turn on that throws an error inside R when 
one

does this.  I cannot imagine that there is much code that wants to
reference non-existing columns in data frames.


But I can -- and do it all the time: To add a new variable, d to a
data frame, df,  containing only a and b (with 10 rows, say):

df[[d]] - 1:10


Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e., select
only not assign, if I understood correctly.



Trying to outguess documentation to create error triggers is a very 
bad idea.


Why exactly is it a very bad idea? (I don't necessarily disagree, just 
asking

for more colour.)


R already has plenty of debugging tools -- and there is even a debug
package. Perhaps you need a better programming editor/IDE. There are
several listed on CRAN, RStudio, etc.


True, but that relies on you knowing there's a bug to hunt for. What if 
you

don't know you're getting incorrect results, silently? In a similar way
that options(warn=2) turns known warnings into errors, to enable you to 
be
more strict if you wish, an option to turn on warnings from `[[` and 
`$`

if the column is missing (select only, not assign) doesn't seem like a
bad option to have. Maybe it would reveal some previously silent bugs.

Anyway, I'm hoping Ivo will let us know if he likes the simple mask I
proposed, or not. That's already an option that can be turned on or 
off.

But if his bug was selecting the wrong column, not a missing one, then
I'm not sure anything could (or needs to be) done about that.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bounty on Error Checking

2013-01-04 Thread Matthew Dowle


On 04.01.2013 14:03, Duncan Murdoch wrote:

On 13-01-04 8:32 AM, Matthew Dowle wrote:


On Fri, Jan 3, 2013, Bert Gunter wrote

Well...

On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at
anderson.ucla.edu wrote:


Dear R developers---I just spent half a day debugging an R 
program,

which had two bugs---I selected the wrongly named variable, which
turns out to have been a scalar, which then happily multiplied as 
if

it was a matrix; and another wrongly named variable from a data
frame,
that triggered no error when used as a[[name]] or a$name .  
there

should be an option to turn on that throws an error inside R when
one
does this.  I cannot imagine that there is much code that wants to
reference non-existing columns in data frames.


But I can -- and do it all the time: To add a new variable, d to 
a

data frame, df,  containing only a and b (with 10 rows, say):

df[[d]] - 1:10


Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e., 
select

only not assign, if I understood correctly.



Trying to outguess documentation to create error triggers is a very
bad idea.


Why exactly is it a very bad idea? (I don't necessarily disagree, 
just

asking
for more colour.)

R already has plenty of debugging tools -- and there is even a 
debug
package. Perhaps you need a better programming editor/IDE. There 
are

several listed on CRAN, RStudio, etc.


True, but that relies on you knowing there's a bug to hunt for. What 
if

you
don't know you're getting incorrect results, silently? In a similar 
way
that options(warn=2) turns known warnings into errors, to enable you 
to

be
more strict if you wish,


I would say the point of options(warn=2) is rather to let you find
the location of the warning more easily, because it will abort the
evaluation.


True but as well as that, I sometimes like to run production systems 
with

options(warn=2). I'd prefer some tasks to halt at the slightest hint of
trouble than write a warning silently to a log file that may not be 
looked
at. I think of that as being more strict, more robust. Since 
option(warn=2)
is set even when there is no warning, to catch if one arises in future. 
Not

just to find it more easily once you know there is a warning.


I would not recommend using code that issues warnings.


Not sure what you mean here.



an option to turn on warnings from `[[` and

`$`
if the column is missing (select only, not assign) doesn't seem like 
a
bad option to have. Maybe it would reveal some previously silent 
bugs.


I agree that this would sometimes be useful, but a very common
convention is to do something like

if (is.null(obj$element)) {  do something }

These would all have to be re-written to something like

if (missing.field(obj, element) { do something }

There are several hundred examples of the first usage in base R; I
imagine thousands more in contributed packages.


Yes but Ivo doesn't seem to be writing that if() in his code. We're
only talking about an option that users can turn on for their own
code, iiuc. Not anything that would affect or break thousands of
packages. That's why I referred to the fact that all packages now
have namespaces, in the earlier post.


I don't think the
benefit of the change is worth all the work that would be necessary 
to

implement it.


It doesn't seem to be a lot of work. I already posted a working
straw man, for example, as a first step.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bounty on Error Checking

2013-01-04 Thread Matthew Dowle


On 04.01.2013 14:56, Duncan Murdoch wrote:

On 04/01/2013 9:51 AM, Matthew Dowle wrote:

On 04.01.2013 14:03, Duncan Murdoch wrote:
 On 13-01-04 8:32 AM, Matthew Dowle wrote:

 On Fri, Jan 3, 2013, Bert Gunter wrote
 Well...

 On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at
 anderson.ucla.edu wrote:

 Dear R developers---I just spent half a day debugging an R
 program,
 which had two bugs---I selected the wrongly named variable, 
which
 turns out to have been a scalar, which then happily multiplied 
as

 if
 it was a matrix; and another wrongly named variable from a data
 frame,
 that triggered no error when used as a[[name]] or a$name .
 there
 should be an option to turn on that throws an error inside R 
when

 one
 does this.  I cannot imagine that there is much code that wants 
to

 reference non-existing columns in data frames.

 But I can -- and do it all the time: To add a new variable, d 
to

 a
 data frame, df,  containing only a and b (with 10 rows, 
say):


 df[[d]] - 1:10

 Yes but that's `[[-`. Ivo was talking about `[[` and `$`; i.e.,
 select
 only not assign, if I understood correctly.


 Trying to outguess documentation to create error triggers is a 
very

 bad idea.

 Why exactly is it a very bad idea? (I don't necessarily disagree,
 just
 asking
 for more colour.)

 R already has plenty of debugging tools -- and there is even a
 debug
 package. Perhaps you need a better programming editor/IDE. There
 are
 several listed on CRAN, RStudio, etc.

 True, but that relies on you knowing there's a bug to hunt for. 
What

 if
 you
 don't know you're getting incorrect results, silently? In a 
similar

 way
 that options(warn=2) turns known warnings into errors, to enable 
you

 to
 be
 more strict if you wish,

 I would say the point of options(warn=2) is rather to let you find
 the location of the warning more easily, because it will abort the
 evaluation.

True but as well as that, I sometimes like to run production systems
with
options(warn=2). I'd prefer some tasks to halt at the slightest hint 
of

trouble than write a warning silently to a log file that may not be
looked
at. I think of that as being more strict, more robust. Since
option(warn=2)
is set even when there is no warning, to catch if one arises in 
future.

Not
just to find it more easily once you know there is a warning.

 I would not recommend using code that issues warnings.

Not sure what you mean here.


I just meant that I consider warnings to be a problem (as you do), so
they should all be fixed.


I see now, good.






 an option to turn on warnings from `[[` and
 `$`
 if the column is missing (select only, not assign) doesn't seem 
like

 a
 bad option to have. Maybe it would reveal some previously silent
 bugs.

 I agree that this would sometimes be useful, but a very common
 convention is to do something like

 if (is.null(obj$element)) {  do something }

 These would all have to be re-written to something like

 if (missing.field(obj, element) { do something }

 There are several hundred examples of the first usage in base R; I
 imagine thousands more in contributed packages.

Yes but Ivo doesn't seem to be writing that if() in his code. We're
only talking about an option that users can turn on for their own
code, iiuc. Not anything that would affect or break thousands of
packages. That's why I referred to the fact that all packages now
have namespaces, in the earlier post.

 I don't think the
 benefit of the change is worth all the work that would be 
necessary

 to
 implement it.

It doesn't seem to be a lot of work. I already posted a working
straw man, for example, as a first step.


I understood the proposal to be that evaluating obj$element would
issue a warning if element didn't exist.  If that were the case, then
the common test

is.null(obj$element)

would issue a warning in the cases where it now returns TRUE.


Yes, but only for obj$element appearing in Ivo's own code. Not if a 
package
does that (including base). That's why I thought masking [[- and 
$-
in .GlobalEnv might achieve that without affecting packages or base, 
although

I don't know how such an option could be made available by R.
Maybe options(strictselect=TRUE) would create those masks in 
.GlobalEnv,

and options(strictselect=FALSE) would remove them. A package maintainer
might choose to set that in their package to make it stricter (which 
would

create those masks in the package's namespace too).

Or users could just create those masks themselves, since it's only a 
few

lines. Without affecting packages or base.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bounty on Error Checking

2013-01-04 Thread Matthew Dowle


On 04.01.2013 15:22, Duncan Murdoch wrote:

On 04/01/2013 10:15 AM, Matthew Dowle wrote:

On 04.01.2013 14:56, Duncan Murdoch wrote:
 On 04/01/2013 9:51 AM, Matthew Dowle wrote:
 On 04.01.2013 14:03, Duncan Murdoch wrote:
  On 13-01-04 8:32 AM, Matthew Dowle wrote:
 
  On Fri, Jan 3, 2013, Bert Gunter wrote
  Well...
 
  On Thu, Jan 3, 2013 at 10:00 AM, ivo welch ivo.welch at
  anderson.ucla.edu wrote:
 
  Dear R developers---I just spent half a day debugging an R
  program,
  which had two bugs---I selected the wrongly named variable,
 which
  turns out to have been a scalar, which then happily 
multiplied

 as
  if
  it was a matrix; and another wrongly named variable from a 
data

  frame,
  that triggered no error when used as a[[name]] or a$name .
  there
  should be an option to turn on that throws an error inside R
 when
  one
  does this.  I cannot imagine that there is much code that 
wants

 to
  reference non-existing columns in data frames.
 
  But I can -- and do it all the time: To add a new variable, 
d

 to
  a
  data frame, df,  containing only a and b (with 10 rows,
 say):
 
  df[[d]] - 1:10
 
  Yes but that's `[[-`. Ivo was talking about `[[` and `$`; 
i.e.,

  select
  only not assign, if I understood correctly.
 
 
  Trying to outguess documentation to create error triggers is 
a

 very
  bad idea.
 
  Why exactly is it a very bad idea? (I don't necessarily 
disagree,

  just
  asking
  for more colour.)
 
  R already has plenty of debugging tools -- and there is even 
a

  debug
  package. Perhaps you need a better programming editor/IDE. 
There

  are
  several listed on CRAN, RStudio, etc.
 
  True, but that relies on you knowing there's a bug to hunt 
for.

 What
  if
  you
  don't know you're getting incorrect results, silently? In a
 similar
  way
  that options(warn=2) turns known warnings into errors, to 
enable

 you
  to
  be
  more strict if you wish,
 
  I would say the point of options(warn=2) is rather to let you 
find
  the location of the warning more easily, because it will abort 
the

  evaluation.

 True but as well as that, I sometimes like to run production 
systems

 with
 options(warn=2). I'd prefer some tasks to halt at the slightest 
hint

 of
 trouble than write a warning silently to a log file that may not 
be

 looked
 at. I think of that as being more strict, more robust. Since
 option(warn=2)
 is set even when there is no warning, to catch if one arises in
 future.
 Not
 just to find it more easily once you know there is a warning.

  I would not recommend using code that issues warnings.

 Not sure what you mean here.

 I just meant that I consider warnings to be a problem (as you do), 
so

 they should all be fixed.

I see now, good.



 
  an option to turn on warnings from `[[` and
  `$`
  if the column is missing (select only, not assign) doesn't 
seem

 like
  a
  bad option to have. Maybe it would reveal some previously 
silent

  bugs.
 
  I agree that this would sometimes be useful, but a very common
  convention is to do something like
 
  if (is.null(obj$element)) {  do something }
 
  These would all have to be re-written to something like
 
  if (missing.field(obj, element) { do something }
 
  There are several hundred examples of the first usage in base 
R; I

  imagine thousands more in contributed packages.

 Yes but Ivo doesn't seem to be writing that if() in his code. 
We're

 only talking about an option that users can turn on for their own
 code, iiuc. Not anything that would affect or break thousands of
 packages. That's why I referred to the fact that all packages now
 have namespaces, in the earlier post.

  I don't think the
  benefit of the change is worth all the work that would be
 necessary
  to
  implement it.

 It doesn't seem to be a lot of work. I already posted a working
 straw man, for example, as a first step.

 I understood the proposal to be that evaluating obj$element 
would
 issue a warning if element didn't exist.  If that were the case, 
then

 the common test

 is.null(obj$element)

 would issue a warning in the cases where it now returns TRUE.

Yes, but only for obj$element appearing in Ivo's own code. Not if a
package
does that (including base). That's why I thought masking [[- and
$-
in .GlobalEnv might achieve that without affecting packages or base,
although
I don't know how such an option could be made available by R.
Maybe options(strictselect=TRUE) would create those masks in
.GlobalEnv,
and options(strictselect=FALSE) would remove them. A package 
maintainer

might choose to set that in their package to make it stricter (which
would
create those masks in the package's namespace too).

Or users could just create those masks themselves, since it's only a
few
lines. Without affecting packages or base.


options() are global


I realise that. I was thinking that inside the options() function it
could see if strictselect was being changed and then create the masks
in .GlobalEnv. But I can see that is ugly

Re: [Rd] How to ensure -O3 on Win64

2012-12-28 Thread Matthew Dowle


On 28.12.2012 00:41, Simon Urbanek wrote:

On Dec 27, 2012, at 6:08 PM, Matthew Dowle wrote:


On 27.12.2012 17:53, Simon Urbanek wrote:

On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote:



Hi,

Similar questions have come up before on the list and elsewhere 
but I haven't found a solution yet.


winbuilder's install.out shows data.table's .c files compiled with 
-O3 on Win32 but -O2 on Win64. The same happens on R-Forge. I gather 
that some packages don't work with -O3 so the default is -O2.


I've tried this in data.table's Makevars (entire contents) :


MAKEFLAGS=CFLAGS=-O3# added
CFLAGS=-O3# added
PKG_CFLAGS=-O3# added
all: $(SHLIB) # no change
mv $(SHLIB) datatable$(SHLIB_EXT) # no change


but -O2 still appears in winbuilder's install.out (after -O3, and 
I believe the last -O is the one that counts) :


gcc -m64 -ID:/RCompile/recent/R-2.15.2/include -DNDEBUG 
-Id:/Rcompile/CRANpkg/extralibs215/local215/include  -O3   -O2 
-Wall -std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o


How can I ensure that data.table is compiled with -O3 on Win64?



You can't - at least not in a way that doesn't circumvent the R 
build

system. Also it's not portable so you don't want to mess with
optimization flags and hard-code it in your package as it's user's
choice how they setup R and its flags. You can certainly setup your 
R

to compile with -O3, you just can't impose that on others.

Cheers,
Simon


Thanks Simon. This makes complete sense where users compile packages 
on install (Unix and Mac, and I better check my settings then), but on 
Windows where it's more common for the user to install the 
pre-compiled .zip from CRAN is my concern. This came up because the 
new fread function in data.table wasn't showing as much of a speedup 
on Win64 as on Linux. I'm not 100% sure that non -O3 is the cause, but 
there are some function calls which get iterated a lot (e.g. isspace) 
and I'd seen that inlining was something -O3 did and not -O2.


In general, why wouldn't a user of a package want the best 
performance from -O3?


Because it doesn't work? I don't know, you said yourself that -O2 may
be there since -O3 breaks - that was not the question, though. (If 
you

are curious about that, ask on CRAN, I don't remember the answer --
note that Win64 compiler support is relatively recent).


Indeed I had forgotten how recent that was. Ok, this is clicking now.


 By non portable do you mean the executable produced by winbuilder 
(or by CRAN) might not run on all Windows machines it's installed on 
(because -O3 (over) optimizes for the machine it's built on), or do 
you mean that -O3 itself might not be available on some compilers (and 
if so which compilers don't have -O3?).




Non-portable as in -O3 may not be supported or may break (we have
seen -O3 trigger bugs in gcc before). If you hard-code it, there is 
no

way around it. The point is that you cannot make decisions for the
user in advance, because you don't know the setup the user may use. I
agree that Windows is a bit of a special-case in that there are very
few choices so the risk of breaking things is lower, but if -O2 is
really such a big deal, it is not just your problem and so you may
want to investigate it further.


Ok thanks a lot for info. I'll try a few more things and follow up off
r-devel if need be.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to ensure -O3 on Win64

2012-12-27 Thread Matthew Dowle


On 27.12.2012 17:53, Simon Urbanek wrote:

On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote:



Hi,

Similar questions have come up before on the list and elsewhere but 
I haven't found a solution yet.


winbuilder's install.out shows data.table's .c files compiled with 
-O3 on Win32 but -O2 on Win64. The same happens on R-Forge. I gather 
that some packages don't work with -O3 so the default is -O2.


I've tried this in data.table's Makevars (entire contents) :


MAKEFLAGS=CFLAGS=-O3# added
CFLAGS=-O3# added
PKG_CFLAGS=-O3# added
all: $(SHLIB) # no change
mv $(SHLIB) datatable$(SHLIB_EXT) # no change


but -O2 still appears in winbuilder's install.out (after -O3, and I 
believe the last -O is the one that counts) :


gcc -m64 -ID:/RCompile/recent/R-2.15.2/include -DNDEBUG 
-Id:/Rcompile/CRANpkg/extralibs215/local215/include  -O3   -O2 -Wall 
-std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o


How can I ensure that data.table is compiled with -O3 on Win64?



You can't - at least not in a way that doesn't circumvent the R build
system. Also it's not portable so you don't want to mess with
optimization flags and hard-code it in your package as it's user's
choice how they setup R and its flags. You can certainly setup your R
to compile with -O3, you just can't impose that on others.

Cheers,
Simon


Thanks Simon. This makes complete sense where users compile packages on 
install (Unix and Mac, and I better check my settings then), but on 
Windows where it's more common for the user to install the pre-compiled 
.zip from CRAN is my concern. This came up because the new fread 
function in data.table wasn't showing as much of a speedup on Win64 as 
on Linux. I'm not 100% sure that non -O3 is the cause, but there are 
some function calls which get iterated a lot (e.g. isspace) and I'd seen 
that inlining was something -O3 did and not -O2.


In general, why wouldn't a user of a package want the best performance 
from -O3?  By non portable do you mean the executable produced by 
winbuilder (or by CRAN) might not run on all Windows machines it's 
installed on (because -O3 (over) optimizes for the machine it's built 
on), or do you mean that -O3 itself might not be available on some 
compilers (and if so which compilers don't have -O3?).


Thanks, Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] read.csv reads more rows than indicated by wc -l

2012-12-20 Thread Matthew Dowle



Ben,

Somewhere on my wish/TO DO list is for someone to rewrite read.table 
for

better robustness *and* efficiency ...


Wish granted. New in data.table 1.8.7 :

=
New function fread(), a fast and friendly file reader.
*  header, skip, nrows, sep and colClasses are all auto detected.
*  integers2^31 are detected and read natively as bit64::integer64.
*  accepts filenames, URLs and A,B\n1,2\n3,4 directly
*  new implementation entirely in C
*  with a 50MB .csv, 1 million rows x 6 columns :
 read.csv(test.csv)# 
30-60 sec
 read.table(test.csv,all known tricks and known nrows)   #
10 sec
 fread(test.csv)   # 
3 sec

* airline data: 658MB csv (7 million rows x 29 columns)
 read.table(2008.csv,all known tricks and known nrows)   #   
360 sec
 fread(2008.csv)   #
50 sec
See ?fread. Many thanks to Chris Neff and Garrett See for ideas, 
discussions

and beta testing.
=

The help page ?fread is fairly well developed :
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markuproot=datatable

Comments, feedback and bug reports very welcome.

Matthew

http://datatable.r-forge.r-project.org/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] built-in NAMED(obj) from within R

2012-11-09 Thread Matthew Dowle

Benjamin Tyner btyner at gmail.com writes:

 
 Hello,
 
 Is it possible to retrieve the 'named' field within the header (sxpinfo) 
 of a object, without resorting to a debugger, external code, etc? 

And much more than just NAMED :

.Internal(inspect(x))

 The goal is to ascertain whether a copy of an object has been made.

Then :

?tracemem

One demonstration of using both together is here :

http://stackoverflow.com/a/10312843/403310

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] There is pmin and pmax each taking na.rm, how about psum?

2012-11-04 Thread Matthew Dowle

 On Sun, Nov 4, 2012 at 6:35 AM, Justin Talbot jtal...@stanford.edu
 wrote:

 Then the case for psum is more for convenience and speed -vs-
 colSums(rbind(x,y), na.rm=TRUE)), since rbind will copy x and y into a
 new
 matrix. The case for pprod is similar, plus colProds doesn't exist.


 Right, and consistency; for what that's worth.

 Thus, + should have the signature: `+`(..., na.rm=FALSE), which would
 allow you to do things like:

 `+`(c(1,2),c(1,2),c(1,2),NA, na.rm=TRUE) = c(3,6)

 If you don't like typing `+`, you could always alias psum to `+`.

 But there would be a cost, wouldn't there? `+` is a dyadic .Primitive.
 Changing that to take `...` and `na.rm` could slow it down (iiuc), and
 any
 changes to the existing language are risky.  For example :
 `+`(1,2,3)
 is currently an error. Changing that to do something might have
 implications for some of the 4,000 packages (some might rely on that
 being
 an error), with a possible speed cost too.


 There would be a very slight performance cost for the current
 interpreter. For the new bytecode compiler though there would be no
 performance cost since the common binary form can be detected at
 compile time and an optimized bytecode can be emitted for it.

 Taking what's currently an error and making it legal is a pretty safe
 change; unless someone is currently relying on `+`(1,2,3) to return an
 error, which I doubt. I think the bigger question on making this
 change work would be on the S3 dispatch logic. I don't understand the
 intricacies of S3 well enough to know if this change is plausible or
 not.

Interesting. Sounds more possible than I thought.


 In contrast, adding two functions that didn't exist before: psum and
 pprod,
 seems to be a safer and simpler proposition.

 Definitely easier. Leaves the language a bit more complicated, but
 that might be the right trade off. I would strongly suggest adding
 pany and pall as well. I find myself wishing for them all the time.
 prange would be nice as well.

 Have a look at the matrixStats package; it might bring what you're looking
 for:

   http://cran.r-project.org/web/packages/matrixStats

 /Henrik

Nice package and very handy. It has colProds, too. But its functions take
a matrix.

' Then the case for psum is more for convenience and speed
-vs-colSums(rbind(x,y), na.rm=TRUE)), since rbind will copy x and y into a
new matrix. '

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] There is pmin and pmax each taking na.rm, how about psum?

2012-11-01 Thread Matthew Dowle


Justin Talbot jtalbot at stanford.edu writes:

  Because that's inconsistent with pmin and pmax when two NAs are summed.
 
  x = c(1,3,NA,NA,5)
  y = c(2,NA,4,NA,1)
  colSums(rbind(x, y), na.rm = TRUE)
  [1] 3 3 4 0 6# actual
  [1] 3 3 4 NA 6   # desired

 But your desired result would be inconsistent with sum:
 sum(NA,NA,na.rm=TRUE)
 [1] 0

 From a language definition perspective I think having psum return 0
 here is right choice.

Ok, you've sold me. psum(NA,NA,na.rm=TRUE) returning 0 sounds good. And
pprod(NA,NA,na.rm=TRUE) returning 1, consistent with prod then.

Then the case for psum is more for convenience and speed -vs-
colSums(rbind(x,y), na.rm=TRUE)), since rbind will copy x and y into a new
matrix. The case for pprod is similar, plus colProds doesn't exist.

 Thus, + should have the signature: `+`(..., na.rm=FALSE), which would
 allow you to do things like:

 `+`(c(1,2),c(1,2),c(1,2),NA, na.rm=TRUE) = c(3,6)

 If you don't like typing `+`, you could always alias psum to `+`.

But there would be a cost, wouldn't there? `+` is a dyadic .Primitive.
Changing that to take `...` and `na.rm` could slow it down (iiuc), and any
changes to the existing language are risky.  For example :
`+`(1,2,3)
is currently an error. Changing that to do something might have
implications for some of the 4,000 packages (some might rely on that being
an error), with a possible speed cost too.

In contrast, adding two functions that didn't exist before: psum and pprod,
seems to be a safer and simpler proposition.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] There is pmin and pmax each taking na.rm, how about psum?

2012-10-30 Thread Matthew Dowle


Hi,

Please consider the following :

x = c(1,3,NA,5)
y = c(2,NA,4,1)

min(x,y,na.rm=TRUE)# ok
[1] 1
max(x,y,na.rm=TRUE)# ok
[1] 5
sum(x,y,na.rm=TRUE)# ok
[1] 16

pmin(x,y,na.rm=TRUE)   # ok
[1] 1 3 4 1
pmax(x,y,na.rm=TRUE)   # ok
[1] 2 3 4 5
psum(x,y,na.rm=TRUE)
[1] 3 3 4 6 # expected result
Error: could not find function psum   # actual result

I realise that + is already like psum, but what about NA?

x+y
[1]  3 NA NA  6# can't supply `na.rm=TRUE` to `+`

Is there a case to add psum? Or have I missed something.

This question survived when I asked on Stack Overflow :
http://stackoverflow.com/questions/13123638/there-is-pmin-and-pmax-each-taking-na-rm-why-no-psum

And a search of the archives found that has Gabor has suggested it too as
an aside :
http://r.789695.n4.nabble.com/How-to-do-it-without-for-loops-tp794745p794750.html

If someone from R core is willing to sponsor the idea, I am willing to
write, test and submit the code for psum. Implemented in a very similar
fashion to pmin and pmax.  Or perhaps it exists already in a package
somewhere (I searched but didn't find it).

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] There is pmin and pmax each taking na.rm, how about psum?

2012-10-30 Thread Matthew Dowle


Because that's inconsistent with pmin and pmax when two NAs are summed.

x = c(1,3,NA,NA,5)
y = c(2,NA,4,NA,1)
colSums(rbind(x, y), na.rm = TRUE)
[1] 3 3 4 0 6# actual
[1] 3 3 4 NA 6   # desired

and it would be less convenient/natural (and slower) than a psum which
would call .Internal(psum(na.rm,...)) in the same way as pmin and pmax.

 Why don't you make a matrix and use colSums or rowSums?

 x = c(1,3,NA,5)
 y = c(2,NA,4,1)
 colSums(rbind(x, y), na.rm = TRUE)


 ir. Thierry Onkelinx
 Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
 Forest
 team Biometrie  Kwaliteitszorg / team Biometrics  Quality Assurance
 Kliniekstraat 25
 1070 Anderlecht
 Belgium
 + 32 2 525 02 51
 + 32 54 43 61 85
 thierry.onkel...@inbo.be
 www.inbo.be

 To call in the statistician after the experiment is done may be no more
 than asking him to perform a post-mortem examination: he may be able to
 say what the experiment died of.
 ~ Sir Ronald Aylmer Fisher

 The plural of anecdote is not data.
 ~ Roger Brinner

 The combination of some data and an aching desire for an answer does not
 ensure that a reasonable answer can be extracted from a given body of
 data.
 ~ John Tukey


 -Oorspronkelijk bericht-
 Van: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org]
 Namens Matthew Dowle
 Verzonden: dinsdag 30 oktober 2012 12:03
 Aan: r-devel@r-project.org
 Onderwerp: [Rd] There is pmin and pmax each taking na.rm, how about psum?


 Hi,

 Please consider the following :

 x = c(1,3,NA,5)
 y = c(2,NA,4,1)

 min(x,y,na.rm=TRUE)# ok
 [1] 1
 max(x,y,na.rm=TRUE)# ok
 [1] 5
 sum(x,y,na.rm=TRUE)# ok
 [1] 16

 pmin(x,y,na.rm=TRUE)   # ok
 [1] 1 3 4 1
 pmax(x,y,na.rm=TRUE)   # ok
 [1] 2 3 4 5
 psum(x,y,na.rm=TRUE)
 [1] 3 3 4 6 # expected result
 Error: could not find function psum   # actual result

 I realise that + is already like psum, but what about NA?

 x+y
 [1]  3 NA NA  6# can't supply `na.rm=TRUE` to `+`

 Is there a case to add psum? Or have I missed something.

 This question survived when I asked on Stack Overflow :
 http://stackoverflow.com/questions/13123638/there-is-pmin-and-pmax-each-taking-na-rm-why-no-psum

 And a search of the archives found that has Gabor has suggested it too as
 an aside :
 http://r.789695.n4.nabble.com/How-to-do-it-without-for-loops-tp794745p794750.html

 If someone from R core is willing to sponsor the idea, I am willing to
 write, test and submit the code for psum. Implemented in a very similar
 fashion to pmin and pmax.  Or perhaps it exists already in a package
 somewhere (I searched but didn't find it).

 Matthew

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
 Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver
 weer en binden het INBO onder geen enkel beding, zolang dit bericht niet
 bevestigd is door een geldig ondertekend document.
 The views expressed in this message and any annex are purely those of the
 writer and may not be regarded as stating an official position of INBO, as
 long as the message is not confirmed by a duly signed document.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Possible page inefficiency in do_matrix in array.c

2012-09-04 Thread Matthew Dowle


 Actually, my apologies, I was assuming that your example was based on the
 SO question while it is not at all (the code is not involved in that test
 case). Reversing the order does indeed cause a delay. Switching to a
 single index doesn't seem to have any impact. R-devel has the faster
 version now (which now also works with large vectors).

 Cheers,
 Simon

I was intrigued why the compiler doesn't swap the loops when you thought
it should, though. You're not usually wrong! From GCC's documentation (end
of last paragraph is the most significant) :



-floop-interchange
Perform loop interchange transformations on loops. Interchanging two
nested loops switches the inner and outer loops. For example, given a loop
like:
  DO J = 1, M
DO I = 1, N
  A(J, I) = A(J, I) * C
ENDDO
  ENDDO

loop interchange transforms the loop as if it were written:

  DO I = 1, N
DO J = 1, M
  A(J, I) = A(J, I) * C
ENDDO
  ENDDO

which can be beneficial when N is larger than the caches, because in
Fortran, the elements of an array are stored in memory contiguously by
column, and the original loop iterates over rows, potentially creating at
each access a cache miss. This optimization applies to all the languages
supported by GCC and is not limited to Fortran. To use this code
transformation, GCC has to be configured with --with-ppl and --with-cloog
to enable the Graphite loop transformation infrastructure.



Could R build scripts be configured to set these gcc flags to turn on
Graphite, then? I guess one downside could be the time to compile.

Matthew



 On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:

 On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:


 In do_matrix in src/array.c there is a type switch containing :

 case LGLSXP :
   for (i = 0; i  nr; i++)
   for (j = 0; j  nc; j++)
   LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

 That seems page inefficient, iiuc. Think it should be :

 case LGLSXP :
   for (j = 0; j  nc; j++)
   for (i = 0; i  nr; i++)
   LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

 or more simply :

 case LGLSXP :
   for (i = 0; i  nc*nr; i++)
   LOGICAL(ans)[i] = NA_LOGICAL;

 ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
 and nr are type int ).

 Same goes for all the other types in that switch.

 This came up on Stack Overflow here :
 http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r


 That is completely irrelevant - modern compilers will optimize the loops
 accordingly and there is no difference in speed. If you don't believe
 it, run benchmarks ;)

 original
 microbenchmark(matrix(nrow=1, ncol=), times=10)
 Unit: milliseconds
   expr  min   lq  median   uq
   max
 1 matrix(nrow = 1, ncol = ) 940.5519 940.6644 941.136 954.7196
 1409.901


 swapped
 microbenchmark(matrix(nrow=1, ncol=), times=10)
 Unit: milliseconds
   expr  min   lq   median  uq
   max
 1 matrix(nrow = 1, ncol = ) 949.9638 950.6642 952.7497 961.001
 1246.573

 Cheers,
 Simon


 Matthew

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Possible page inefficiency in do_matrix in array.c

2012-09-02 Thread Matthew Dowle


In do_matrix in src/array.c there is a type switch containing :

case LGLSXP :
for (i = 0; i  nr; i++)
for (j = 0; j  nc; j++)
LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

That seems page inefficient, iiuc. Think it should be :

case LGLSXP :
for (j = 0; j  nc; j++)
for (i = 0; i  nr; i++)
LOGICAL(ans)[i + j * NR] = NA_LOGICAL;

or more simply :

case LGLSXP :
for (i = 0; i  nc*nr; i++)
LOGICAL(ans)[i] = NA_LOGICAL;

( with some fine tuning required since NR is type R_xlen_t whilst i, nc
and nr are type int ).

Same goes for all the other types in that switch.

This came up on Stack Overflow here :
http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Non ascii character on Mac on CRAN (C locale)

2012-07-16 Thread Matthew Dowle


Dear all,

A recent bug fix for data.table was for non-ascii characters in column
names and grouping by those column. So, the package's test file now
includes non-ascii characters to test that bug fix :

  # Test non ascii characters when passed as character by, #2134
  x = rep(LETTERS[1:2], 3)
  y = rep(1:3, each=2)
  DT = data.table(ÅR=x, foo=y)
  test(708, names(DT[, mean(foo), by=ÅR]), c(ÅR,V1))
  test(709, DT[, mean(foo), by=ÅR], DT[, mean(foo), by=ÅR])
  DT = data.table(FÅR=x, foo=y)
  test(710, names(DT[, mean(foo), by=FÅR]), c(FÅR,V1))
  DT = data.table(ÆØÅ=x, foo=y)
  test(711, DT[, mean(foo), by=ÆØÅ], data.table(ÆØÅ=c(A,B), V1=2))
  test(712, DT[, mean(foo), by=ÆØÅ], data.table(ÆØÅ=c(A,B), V1=2))

This passes R CMD check on Linux, Windows and Mac on R-Forge, but not on
Mac on CRAN because Prof Ripley advises that uses the C locale.

It works on Windows because data.table does this first :

  oldenc = options(encoding=UTF-8)[[1L]]
  sys.source(tests.R)  # the file that includes the tests above
  options(encoding=oldenc)

If I change it to the following, will it work on CRAN's Mac, and is this
ok/correct?  Since it passes on R-Forge's Mac, I can't think how else to
test this.

  oldlocale = Sys.getlocale(LC_CTYPE)
  if (oldlocale==C) Sys.setlocale(LC_CTYPE,en_GB.UTF-8)
  oldenc = options(encoding=UTF-8)[[1L]]
  sys.source(tests.R)  # the file that includes the tests above
  options(encoding=oldenc)
  Sys.setlocalte(LC_CTYPE,oldlocale)

Many thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Understanding tracemem

2012-07-12 Thread Matthew Dowle

Hadley Wickham hadley at rice.edu writes:

 Why does x[5] - 5 create a copy

That assigns 5 not 5L. x is being coerced from integer to double.

x[5] - 5L doesn't copy.

 , when x[11] (which should be
 extending a vector does not) ?  I can understand that maybe x[5] - 5
 hasn't yet been optimised to not make a copy, but if that's the case
 then why doesn't x[11] - 11 make one?

Extending a vector is creating a new (longer) vector and copying the old 
(shorter) one in.  That's different to duplicate().  tracemem only reports 
calls to duplicate().

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6

2012-06-14 Thread Matthew Dowle

Matthew Dowle mdowle at mdowle.plus.com writes:
 
 Will check R-Forge again when it catches up. Thanks.
 Matthew
 

Just to confirm, R-Forge has today caught up and is now using R r59554 which 
includes the fix for the problem in this thread. Its binary build of data.table 
is now installing fine on R 2.15.0 release, which it wasn't doing before.

Many thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to change name of .so/.dll

2012-06-13 Thread Matthew Dowle

On Tue, 2012-06-12 at 20:38 -0400, Simon Urbanek wrote:
 Something like
 
 all: $(SHLIB)
 mv $(SHLIB) datatable$(SHLIB_EXT)
 
 should do the trick (resist the temptation to create a datatable$(SHLIB_EXT) 
 target - it doesn't work due to the makefile loading sequence, 
 unfortunately). AFAIR you don't need to mess with install.libs because the 
 default is to install all shlibs in the directory.
 
 Cheers,
 Simon

Huge thank you, Simon. Works perfectly. +100!

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to change name of .so/.dll

2012-06-13 Thread Matthew Dowle

Matthew Dowle mdowle at mdowle.plus.com writes:

 
 On Tue, 2012-06-12 at 20:38 -0400, Simon Urbanek wrote:
  Something like
  
  all: $(SHLIB)
  mv $(SHLIB) datatable$(SHLIB_EXT)
  
  should do the trick (resist the temptation to create a 
datatable$(SHLIB_EXT) target - it doesn't work due
 to the makefile loading sequence, unfortunately). AFAIR you don't need to 
mess with install.libs
 because the default is to install all shlibs in the directory.
  
  Cheers,
  Simon
 
 Huge thank you, Simon. Works perfectly. +100!
 
 Matthew

I guess the 'mv' command works on Mac, too. For Windows I think I need to 
create pkg/src/Makevars.win with 'mv' replaced by 'rename'. Is that right?

all: $(SHLIB)
rename $(SHLIB) datatable$(SHLIB_EXT)

I could try that and submit to winbuilder and see, but asking here as well in 
case theres anything else to consider for Windows.

Thanks again, Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to change name of .so/.dll

2012-06-13 Thread Matthew Dowle

 On 12-06-13 4:45 AM, Matthew Dowle wrote:
 Matthew Dowlemdowleat  mdowle.plus.com  writes:


 On Tue, 2012-06-12 at 20:38 -0400, Simon Urbanek wrote:
 Something like

 all: $(SHLIB)
  mv $(SHLIB) datatable$(SHLIB_EXT)

 should do the trick (resist the temptation to create a
 datatable$(SHLIB_EXT) target - it doesn't work due
 to the makefile loading sequence, unfortunately). AFAIR you don't need
 to
 mess with install.libs
 because the default is to install all shlibs in the directory.

 Cheers,
 Simon

 Huge thank you, Simon. Works perfectly. +100!

 Matthew

 I guess the 'mv' command works on Mac, too. For Windows I think I need
 to
 create pkg/src/Makevars.win with 'mv' replaced by 'rename'. Is that
 right?

 all: $(SHLIB)
  rename $(SHLIB) datatable$(SHLIB_EXT)

 I could try that and submit to winbuilder and see, but asking here as
 well in
 case theres anything else to consider for Windows.


 mv should be fine on Windows.  If you have a makefile, you have Rtools
 installed, and mv is in Rtools.

 Duncan Murdoch

Neat. Glad I asked, thanks.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] How to change name of .so/.dll

2012-06-12 Thread Matthew Dowle


Hi,

I've added R_init_data_table to the data.table package (which has a dot
in its name). This works well in R 2.15.0, because of this from the
Writing R Extensions manual :

 Note that there are some implicit restrictions on this mechanism as the
basename of the DLL needs to be both a valid file name and valid as part
of a C entry point (e.g. it cannot contain .): for portable code it is
best to confine DLL names to be ASCII alphanumeric plus underscore. As
from R 2.15.0, if entry point R_init_lib is not found it is also looked
for with . replaced by _. 

But how do I confine the DLL name, is it an option in Makevars?

The name of the shared object is currently data.table.so (data.table.dll
on Windows).  Is it possible to change the file name to datatable.so
(and datatable.dll) in a portable way so that R_init_datatable works
(without a dot), and, without Depend-ing on R=2.15.0 and without changing
the name of the package?

Thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to change name of .so/.dll

2012-06-12 Thread Matthew Dowle

Matthew Dowle wrote :
 Hi,

 I've added R_init_data_table to the data.table package (which has a dot
 in its name). This works well in R 2.15.0, because of this from the
 Writing R Extensions manual :

  Note that there are some implicit restrictions on this mechanism as the
 basename of the DLL needs to be both a valid file name and valid as part
 of a C entry point (e.g. it cannot contain .): for portable code it is
 best to confine DLL names to be ASCII alphanumeric plus underscore. As
 from R 2.15.0, if entry point R_init_lib is not found it is also looked
 for with . replaced by _. 

 But how do I confine the DLL name, is it an option in Makevars?

 The name of the shared object is currently data.table.so (data.table.dll
 on Windows).  Is it possible to change the file name to datatable.so
 (and datatable.dll) in a portable way so that R_init_datatable works
 (without a dot), and, without Depend-ing on R=2.15.0 and without changing
 the name of the package?

Just to clarify, I'm aware R CMD SHLIB has the -o argument which can be
used create datatable.so instead of data.table.so. It's R CMD INSTALL
that's the problem as that seems to pass -o pkg_name to R CMD SHLIB.  I
found install.libs.R (added to R in 2.13.1), could that be used to create
datatable.so instead of data.table.so? Or a line I could add to
pkg/src/Makevars?

Thanks!
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to change name of .so/.dll

2012-06-12 Thread Matthew Dowle

Matthew Dowle wrote :
 Hi,

 I've added R_init_data_table to the data.table package (which has a dot
 in its name). This works well in R 2.15.0, because of this from the
 Writing R Extensions manual :

  Note that there are some implicit restrictions on this mechanism as the
 basename of the DLL needs to be both a valid file name and valid as part
 of a C entry point (e.g. it cannot contain .): for portable code it is
 best to confine DLL names to be ASCII alphanumeric plus underscore. As
 from R 2.15.0, if entry point R_init_lib is not found it is also looked
 for with . replaced by _. 

 But how do I confine the DLL name, is it an option in Makevars?

 The name of the shared object is currently data.table.so (data.table.dll
 on Windows).  Is it possible to change the file name to datatable.so
 (and datatable.dll) in a portable way so that R_init_datatable works
 (without a dot), and, without Depend-ing on R=2.15.0 and without changing
 the name of the package?

Just to clarify, I'm aware R CMD SHLIB has the -o argument which can be
used create datatable.so instead of data.table.so. It's R CMD INSTALL
that's the problem as that seems to pass -o pkg_name to R CMD SHLIB.  I
found install.libs.R (added to R in 2.13.1), could that be used to create
datatable.so instead of data.table.so? Or a line I could add to
pkg/src/Makevars?

Thanks!
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] suggest that as.double( something double ) not make a copy

2012-06-07 Thread Matthew Dowle

Henrik Bengtsson hb at biostat.ucsf.edu writes:

 See also R-devel '[Rd] Suggestion for memory optimization and
 as.double() with friends', March 28-29 2007
 [https://stat.ethz.ch/pipermail/r-devel/2007-March/045109.html].
 
 /Henrik

Interesting thread. So we have you to thank for instigating that 5 years ago: 
thanks!

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6

2012-06-07 Thread Matthew Dowle


Prof Ripley wrote :
 That Depends line is about source installs.

I can't see that documented in either Writing R Extensions or
?install.packages. Is it somewhere else? I thought Depends applied to
binaries from CRAN too, which is the default method on Windows and Mac.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6

2012-06-07 Thread Matthew Dowle

 On 07/06/2012 11:40, Matthew Dowle wrote:

 Prof Ripley wrote :
 That Depends line is about source installs.

 I can't see that documented in either Writing R Extensions or
 ?install.packages. Is it somewhere else? I thought Depends applied to
 binaries from CRAN too, which is the default method on Windows and Mac.

 That field is documented under the description of a *source* package
 (see the first line of section 1.1, and it is in that section) and is
 simply copied from the source package for binary installs.  It is the
 extra line added to the DESCRIPTION file, e.g.

 Built: R 2.15.0; x86_64-pc-mingw32; 2012-04-02 09:27:07 UTC; windows

 that tells you the version a binary package was built under
 (approximately for R-patched and R-devel), and library() checks.

I'm fairly sure I understand all that. I'm still missing something more
basic probably. Consider the follow workflow :

I look on CRAN at package boot. Its webpage states Depends R (=
2.14.0). I'm a user running R and I know I use 2.14.1, so I think great I
can use it. I install it as follows.

 version
version.string R version 2.14.1 (2011-12-22)
 install.packages(boot)
trying URL
'http://cran.ma.imperial.ac.uk/bin/windows/contrib/2.14/boot_1.3-4.zip'
Content type 'application/zip' length 469615 bytes (458 Kb)
opened URL
downloaded 458 Kb

package boot successfully unpacked and MD5 sums checked

 require(boot)
Loading required package: boot
Warning message:
package boot was built under R version 2.14.2


Does this mean that CRAN maintainers expect me to run the latest version
of the major release I'm using (R 2.14.2 in this case), not the current
release of R (R 2.15.0 currently) as you wrote earlier?   If that's the
case I never realised it before, but that seems very reasonable.  When I
ran the above just now I expected it to say package 'boot' was built
under R version 2.15.0.  But it didn't, it said 2.14.2.  So it seems to
be my misunderstanding.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6

2012-06-07 Thread Matthew Dowle

 On 07/06/2012 12:49, Matthew Dowle wrote:
 On 07/06/2012 11:40, Matthew Dowle wrote:

 Prof Ripley wrote :
 That Depends line is about source installs.

 I can't see that documented in either Writing R Extensions or
 ?install.packages. Is it somewhere else? I thought Depends applied to
 binaries from CRAN too, which is the default method on Windows and
 Mac.

 That field is documented under the description of a *source* package
 (see the first line of section 1.1, and it is in that section) and is
 simply copied from the source package for binary installs.  It is the
 extra line added to the DESCRIPTION file, e.g.

 Built: R 2.15.0; x86_64-pc-mingw32; 2012-04-02 09:27:07 UTC; windows

 that tells you the version a binary package was built under
 (approximately for R-patched and R-devel), and library() checks.

 I'm fairly sure I understand all that. I'm still missing something more
 basic probably. Consider the follow workflow :

 I look on CRAN at package boot. Its webpage states Depends R (=
 2.14.0). I'm a user running R and I know I use 2.14.1, so I think great
 I
 can use it. I install it as follows.

 version
 version.string R version 2.14.1 (2011-12-22)
 install.packages(boot)
 trying URL
 'http://cran.ma.imperial.ac.uk/bin/windows/contrib/2.14/boot_1.3-4.zip'
 Content type 'application/zip' length 469615 bytes (458 Kb)
 opened URL
 downloaded 458 Kb

 package boot successfully unpacked and MD5 sums checked

 require(boot)
 Loading required package: boot
 Warning message:
 package boot was built under R version 2.14.2


 Does this mean that CRAN maintainers expect me to run the latest version
 of the major release I'm using (R 2.14.2 in this case), not the current
 release of R (R 2.15.0 currently) as you wrote earlier?   If that's the
 case I never realised it before, but that seems very reasonable.  When I
 ran the above just now I expected it to say package 'boot' was built
 under R version 2.15.0.  But it didn't, it said 2.14.2.  So it seems to
 be my misunderstanding.

 2.15.x and 2.14.x are different series, with different binary repos.

Thanks. So CRAN will continue to build and check new versions of packages
using R 2.14.2 in the 2.14.x repo, whilst R 2.15.x progresses separately.
I'm familiar with r-oldrel results on CRAN package check results page, but
for some reason I had missed the nuance there's a binary repo too for
r-oldrel. That's great.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6

2012-06-07 Thread Matthew Dowle

I built R-trunk (rev 59537), ran 'Rtrunk CMD build data.table',
installed the resulting tar.gz into R release and it ran tests ok. So it
seems ok now, if that tested it right. Will check R-Forge again when it
catches up. Thanks.
Matthew

On Wed, 2012-06-06 at 22:04 +0200, peter dalgaard wrote:
 FYI, Brian has backed out the changes to identical() in r59533 of R-patched. 
 Please retry your test codes with the new version. (Due to some ISP mess-up, 
 Brian is temporarily unable to reply in detail himself.)
 
 -pd
 
 
 On Jun 6, 2012, at 20:29 , luke-tier...@uiowa.edu luke-tier...@uiowa.edu 
 wrote:
 
  On Wed, 6 Jun 2012, Matthew Dowle wrote:
  
  Dan Tenenbaum dtenenba at fhcrc.org writes:
  
  
  I know this has come up before on R-help
  (http://r.789695.n4.nabble.com/7-arguments-passed-to-Internal-identical-which-
  requires-6-td4548460.html)
  but I have a concise reproducible case that I wanted to share.
  
  Also, please note the Bioconductor scenario which is potentially
  seriously impacted by this.
  The issue arises when a binary version of a package (like my example
  package below) is built under R 2.15.0 Patched but then installed
  under R 2.15.0.  Our package AnnotationDbi (which hundreds of other
  packages depend on) is impacted by this issue to the extent that
  calling virtually any function in it will return something like this:
  Error in ls(2) :
   7 arguments passed to .Internal(identical) which requires 6
  
  My concern is that when R 2.15.1 is released and Bioconductor starts
  building all its packages under it, that R 2.15.0 users will start to
  experience this problem. We can ask all users to upgrade to R 2.15.1
  if we have to, but it's not usually the case that a minor point
  release MUST be installed in order to run packages built under it
  (please correct me if I'm wrong). We would much prefer a workaround or
  fix to make an upgrade unnecessary.
  
  
  I'm seeing the same issue. Installing the latest R-Forge .zip of data.table
  built using 2.15.0 patched, on R 2.15.0 (or 2.14.1 same issue), then 
  running
  data.table(a=1:3) produces the 7 arguments passed to .Internal(identical)
  which requires 6 error.  traceback() and debugger() just display the top 
  level
  call. debug(data.table) and stepping through reveals it is a call to 
  identical
  () but just a regular one. No .Internal() call in the package, let alone
  passing 6 or 7 arguments to .Internal.
  
  Not sure how else to debug or trace it. R-Forge is byte compiling 
  data.table
  using R 2.15.0 patched (iiuc), would that make a difference when the byte 
  code
  is loaded into 2.15.0 which doesn't have the new argument in identical()?
  
  Yes it would.
  
  luke
  
  
  Matthew
  
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel
  
  
  -- 
  Luke Tierney
  Chair, Statistics and Actuarial Science
  Ralph E. Wareham Professor of Mathematical Sciences
  University of Iowa  Phone: 319-335-3386
  Department of Statistics andFax:   319-335-3017
Actuarial Science
  241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
  Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
  
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 7 arguments passed to .Internal(identical) which requires 6

2012-06-06 Thread Matthew Dowle

Dan Tenenbaum dtenenba at fhcrc.org writes:

 
 I know this has come up before on R-help
 (http://r.789695.n4.nabble.com/7-arguments-passed-to-Internal-identical-which-
requires-6-td4548460.html)
 but I have a concise reproducible case that I wanted to share.
 
 Also, please note the Bioconductor scenario which is potentially
 seriously impacted by this.
 The issue arises when a binary version of a package (like my example
 package below) is built under R 2.15.0 Patched but then installed
 under R 2.15.0.  Our package AnnotationDbi (which hundreds of other
 packages depend on) is impacted by this issue to the extent that
 calling virtually any function in it will return something like this:
 Error in ls(2) :
   7 arguments passed to .Internal(identical) which requires 6
 
 My concern is that when R 2.15.1 is released and Bioconductor starts
 building all its packages under it, that R 2.15.0 users will start to
 experience this problem. We can ask all users to upgrade to R 2.15.1
 if we have to, but it's not usually the case that a minor point
 release MUST be installed in order to run packages built under it
 (please correct me if I'm wrong). We would much prefer a workaround or
 fix to make an upgrade unnecessary.
 

I'm seeing the same issue. Installing the latest R-Forge .zip of data.table 
built using 2.15.0 patched, on R 2.15.0 (or 2.14.1 same issue), then running 
data.table(a=1:3) produces the 7 arguments passed to .Internal(identical) 
which requires 6 error.  traceback() and debugger() just display the top level 
call. debug(data.table) and stepping through reveals it is a call to identical
() but just a regular one. No .Internal() call in the package, let alone 
passing 6 or 7 arguments to .Internal.

Not sure how else to debug or trace it. R-Forge is byte compiling data.table 
using R 2.15.0 patched (iiuc), would that make a difference when the byte code 
is loaded into 2.15.0 which doesn't have the new argument in identical()? 

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] suggest that as.double( something double ) not make a copy

2012-06-06 Thread Matthew Dowle

Tim Hesterberg timhesterberg at gmail.com writes:

 I've been playing with passing arguments to .C(), and found that replacing
 as.double(x)
 with
 if(is.double(x)) x else as.double(x)
 saves time and avoids one copy, in the case that x is already double.
 
 I suggest modifying as.double to avoid the extra copy and just
 return x, when x is already double. Similarly for as.integer, etc.
 

But as.double() already doesn't copy if its argument is already double. Unless, 
your double has attributes?

From coerce.c :

if(TYPEOF(x) == type) {
if(ATTRIB(x) == R_NilValue) return x;
ans = NAMED(x) ? duplicate(x) : x;
CLEAR_ATTRIB(ans);
return ans;
}

quick test :

 x=1
 .Internal(inspect(x))
@03E23620 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1
 .Internal(inspect(as.double(x)))   # no copy
@03E23620 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1
 x=c(foo=1)   # give x some attributes, say names
 x
foo 
  1 
 .Internal(inspect(x))
@03E234D0 14 REALSXP g0c1 [NAM(1),ATT] (len=1, tl=0) 1
ATTRIB:
  @03D54910 02 LISTSXP g0c0 [] 
TAG: @00380088 01 SYMSXP g0c0 [MARK,gp=0x4000] names
@03E234A0 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
  @03E23560 09 CHARSXP g0c1 [gp=0x21] foo
 .Internal(inspect(as.double(x)))   # strips attribs returning new obj
@03E233B0 14 REALSXP g0c1 [] (len=1, tl=0) 1
 as.double(x)
[1] 1
 

Attribute stripping is documented in ?as.double. Rather than as.double() on the 
R side, you could use coerceVector() on the C side, which might be easier to 
use via .Call than .C since it takes an SEXP. Looking at coerceVector in 
coerce.c its first line returns immediately if type is already the desired 
type, with no attribute stripping, so that seems like the way to go?

If your double has no attributes then I'm barking up the wrong tree.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Expected behaviour of is.unsorted?

2012-05-24 Thread Matthew Dowle

Duncan Murdoch murdoch.duncan at gmail.com writes:
 
 On 12-05-23 4:37 AM, Matthew Dowle wrote:
 
  Hi,
 
  I've read ?is.unsorted and searched. Have found a few items but nothing
  close, yet. Is the following expected?
 
  is.unsorted(data.frame(1:2))
  [1] FALSE
  is.unsorted(data.frame(2:1))
  [1] FALSE
  is.unsorted(data.frame(1:2,3:4))
  [1] TRUE
  is.unsorted(data.frame(2:1,4:3))
  [1] TRUE
 
  IIUC, is.unsorted is intended for atomic vectors only (description of x in
  ?is.unsorted). Indeed the C source (src/main/sort.c) contains an error
  message only atomic vectors can be tested to be sorted. So that is the
  error message I expected to see in all cases above, since I know that
  data.frame is not an atomic vector. But there is also this in
  ?is.unsorted: except for atomic vectors and objects with a class (where
  the= or  method is used) which I don't understand. Where= or  is
  used by what, and where?
 
 If you look at the source, you will see that the basic test for classed 
 objects is
 
 all(x[-1L] = x[-length(x)])
 
 (in the function base:::.gtn).
 
 This comparison doesn't really makes sense for dataframes, but it does 
 seem to be backwards:  that tests that x[2] = x[1], x[3] = x[2], etc., 
 returning TRUE if all comparisons are TRUE:  but that sounds like it 
 should be is.sorted(), not is.unsorted().  Or is it my brain that is 
 backwards?

Thanks. Yes you're right. So is.unsorted() on a data.frame is trying to tell us 
if there exists any unsorted row, it seems.

 DF = data.frame(a=c(1,3,5),b=c(1,3,5))
 DF
  a b
1 1 1   # this row is sorted
2 3 3   # this row is sorted
3 5 5   # this row is sorted
 is.unsorted(DF)   # going by row but should be !.gtn
[1] TRUE
 with(DF,is.unsorted(order(a,b)))  # most people's natural expectation I guess
[1] FALSE
 DF[2,2]=2
 DF
  a b
1 1 1   # this row is sorted
2 3 2   # this row isn't sorted
3 5 5   # this row is sorted
 is.unsorted(DF)   # going by row but should be !.gtn
[1] FALSE
 with(DF,is.unsorted(order(a,b)))  # most people's natural expectation I guess
[1] FALSE

Since it seems to have a bug anyway (and if so, can't be correct in anyone's 
use of it), could either is.unsorted on a data.frame return the error that's in 
the C code already: only atomic vectors can be tested to be sorted, for 
safety and to lessen confusion, or be changed to return the natural expectation 
proposed above? The easiest quick fix would be to negate the result of the .gtn 
call of course, but then you could never go back.

Matthew

 Duncan Murdoch
 
 
  I understand why the first two are FALSE (1 item of anything must be
  sorted). I don't understand the 3rd and 4th cases where length is 2:
  do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does
  that fall back to TRUE for some reason?
 
  Matthew
 
  sessionInfo()
  R version 2.15.0 (2012-03-30)
  Platform: x86_64-pc-mingw32/x64 (64-bit)
 
  locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
  Kingdom.1252
  [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
  [5] LC_TIME=English_United Kingdom.1252
 
  attached base packages:
  [1] stats graphics  grDevices utils datasets  methods   base
 
  other attached packages:
  [1] data.table_1.8.0
 
  loaded via a namespace (and not attached):
  [1] tools_2.15.0
 
  __
  R-devel at r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel
 


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Expected behaviour of is.unsorted?

2012-05-24 Thread Matthew Dowle

Duncan Murdoch murdoch.duncan at gmail.com writes:
 
 On 12-05-24 7:39 AM, Matthew Dowle wrote:
  Duncan Murdochmurdoch.duncanat  gmail.com  writes:
 
  On 12-05-23 4:37 AM, Matthew Dowle wrote:
  Since it seems to have a bug anyway (and if so, can't be correct in anyone's
  use of it), could either is.unsorted on a data.frame return the error 
that's in
  the C code already: only atomic vectors can be tested to be sorted, for
  safety and to lessen confusion, or be changed to return the natural 
expectation
  proposed above? The easiest quick fix would be to negate the result of 
the .gtn
  call of course, but then you could never go back.
 
 I don't follow the last sentence.  If the .gtn call needs to be negated, 
 why would you want to go back?

Because then is.unsorted(DF) would work, but go by row, which you guessed above 
wasn't intended and isn't sensible. But once it worked in that way, users might 
start to depend on it; e.g., by writing is.unsorted(t(DF)). If I came 
along in future and suggested that was inefficient and wouldn't it be more 
natural and efficient if is.unsorted(DF) went by column, returning the same as 
with(DF,is.unsorted(order(a,b))) but implemented efficiently, you would fear 
that user code now depended on it going by row and say it was too late. I'd 
persist and highlight that it didn't seem in keeping with the spirit of 
is.unsorted()'s speed since it short circuits on the first unsorted item, which 
is why we love it. You'd reply that's not documented. Which it isn't. And that 
would be the end of that.

 Duncan Murdoch
 
 
  Matthew
 
  Duncan Murdoch
 
 
  I understand why the first two are FALSE (1 item of anything must be
  sorted). I don't understand the 3rd and 4th cases where length is 2:
  do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does
  that fall back to TRUE for some reason?
 
  Matthew
 
  sessionInfo()
  R version 2.15.0 (2012-03-30)
  Platform: x86_64-pc-mingw32/x64 (64-bit)
 
  locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
  Kingdom.1252
  [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
  [5] LC_TIME=English_United Kingdom.1252
 
  attached base packages:
  [1] stats graphics  grDevices utils datasets  methods   base
 
  other attached packages:
  [1] data.table_1.8.0
 
  loaded via a namespace (and not attached):
  [1] tools_2.15.0
 
  __
  R-develat  r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
  __
  R-devel at r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel
 


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Expected behaviour of is.unsorted?

2012-05-24 Thread Matthew Dowle

 On 24/05/2012 9:15 AM, Matthew Dowle wrote:
 Duncan Murdochmurdoch.duncanat  gmail.com  writes:
 
   On 12-05-24 7:39 AM, Matthew Dowle wrote:
 Duncan Murdochmurdoch.duncanat   gmail.com   writes:
   
 On 12-05-23 4:37 AM, Matthew Dowle wrote:
 Since it seems to have a bug anyway (and if so, can't be correct
 in anyone's
 use of it), could either is.unsorted on a data.frame return the
 error
 that's in
 the C code already: only atomic vectors can be tested to be
 sorted, for
 safety and to lessen confusion, or be changed to return the
 natural
 expectation
 proposed above? The easiest quick fix would be to negate the
 result of
 the .gtn
 call of course, but then you could never go back.
 
   I don't follow the last sentence.  If the .gtn call needs to be
 negated,
   why would you want to go back?

 Because then is.unsorted(DF) would work, but go by row, which you
 guessed above
 wasn't intended and isn't sensible. But once it worked in that way,
 users might
 start to depend on it; e.g., by writing is.unsorted(t(DF)). If I came
 along in future and suggested that was inefficient and wouldn't it be
 more
 natural and efficient if is.unsorted(DF) went by column, returning the
 same as
 with(DF,is.unsorted(order(a,b))) but implemented efficiently, you would
 fear
 that user code now depended on it going by row and say it was too late.
 I'd
 persist and highlight that it didn't seem in keeping with the spirit of
 is.unsorted()'s speed since it short circuits on the first unsorted
 item, which
 is why we love it. You'd reply that's not documented. Which it isn't.
 And that
 would be the end of that.

 Okay, I'm going to fix the handling of .gtn results, and document the
 unsuitability of this
 function for dataframes and arrays.

But that leaves the door open to confusion later, whilst closing the door
to a better solution: making is.unsorted() work by column for data.frame;
i.e., making is.unsorted _suitable_ for data.frame. If you just do the
quick fix for .gtn result you can never go back. If making is.unsorted(DF)
work by column is too hard for now, then leaving the door open would be
better by returning the error message already in the C code: only atomic
vectors can be tested to be sorted. That would be a better quick fix
since it leaves options for the future.

 Duncan Murdoch


   Duncan Murdoch
 
   
 Matthew
   
 Duncan Murdoch
   
   
 I understand why the first two are FALSE (1 item of anything
 must be
 sorted). I don't understand the 3rd and 4th cases where length
 is 2:
 do_isunsorted seems to call lang3(install(.gtn), x,
 CADR(args))). Does
 that fall back to TRUE for some reason?
   
 Matthew
   
 sessionInfo()
 R version 2.15.0 (2012-03-30)
 Platform: x86_64-pc-mingw32/x64 (64-bit)
   
 locale:
 [1] LC_COLLATE=English_United Kingdom.1252
 LC_CTYPE=English_United
 Kingdom.1252
 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
 [5] LC_TIME=English_United Kingdom.1252
   
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 base
   
 other attached packages:
 [1] data.table_1.8.0
   
 loaded via a namespace (and not attached):
 [1] tools_2.15.0
   
 __
 R-develat   r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
   
   
   
 __
 R-develat  r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Expected behaviour of is.unsorted?

2012-05-24 Thread Matthew Dowle

 On 24/05/2012 11:10 AM, Matthew Dowle wrote:
   On 24/05/2012 9:15 AM, Matthew Dowle wrote:
   Duncan Murdochmurdoch.duncanat   gmail.com   writes:
   
  On 12-05-24 7:39 AM, Matthew Dowle wrote:
 Duncan Murdochmurdoch.duncanatgmail.comwrites:
  
 On 12-05-23 4:37 AM, Matthew Dowle wrote:
 Since it seems to have a bug anyway (and if so, can't be
 correct
   in anyone's
 use of it), could either is.unsorted on a data.frame return
 the
   error
   that's in
 the C code already: only atomic vectors can be tested to be
   sorted, for
 safety and to lessen confusion, or be changed to return the
   natural
   expectation
 proposed above? The easiest quick fix would be to negate the
   result of
   the .gtn
 call of course, but then you could never go back.
   
  I don't follow the last sentence.  If the .gtn call needs to be
   negated,
  why would you want to go back?
 
   Because then is.unsorted(DF) would work, but go by row, which you
   guessed above
   wasn't intended and isn't sensible. But once it worked in that way,
   users might
   start to depend on it; e.g., by writing is.unsorted(t(DF)). If I
 came
   along in future and suggested that was inefficient and wouldn't it
 be
   more
   natural and efficient if is.unsorted(DF) went by column, returning
 the
   same as
   with(DF,is.unsorted(order(a,b))) but implemented efficiently, you
 would
   fear
   that user code now depended on it going by row and say it was too
 late.
   I'd
   persist and highlight that it didn't seem in keeping with the spirit
 of
   is.unsorted()'s speed since it short circuits on the first unsorted
   item, which
   is why we love it. You'd reply that's not documented. Which it
 isn't.
   And that
   would be the end of that.
 
   Okay, I'm going to fix the handling of .gtn results, and document the
   unsuitability of this
   function for dataframes and arrays.

 But that leaves the door open to confusion later, whilst closing the
 door
 to a better solution: making is.unsorted() work by column for
 data.frame;
 i.e., making is.unsorted _suitable_ for data.frame. If you just do the
 quick fix for .gtn result you can never go back. If making
 is.unsorted(DF)
 work by column is too hard for now, then leaving the door open would be
 better by returning the error message already in the C code: only
 atomic
 vectors can be tested to be sorted. That would be a better quick fix
 since it leaves options for the future.

 I don't see why saying this function is unsuitable for dataframes
 implies that it will never be made suitable for dataframes.

If user code or packages start to depend on is.unsorted(t(DF)) it would be
harder to change, no? Why provide something that is unsuitable and allow
that possibility to happen? It's more user friendly to return not
implemented, unsuitable, or the nicer message already in the C code,
than leave the door open for confusion and errors. Or in other words, it's
even more user friendly to return a warning or error to the user at the
prompt, than the user friendliness of writing in the help file that it's
unsuitable for data.frame.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Expected behaviour of is.unsorted?

2012-05-23 Thread Matthew Dowle


Hi,

I've read ?is.unsorted and searched. Have found a few items but nothing
close, yet. Is the following expected?

 is.unsorted(data.frame(1:2))
[1] FALSE
 is.unsorted(data.frame(2:1))
[1] FALSE
 is.unsorted(data.frame(1:2,3:4))
[1] TRUE
 is.unsorted(data.frame(2:1,4:3))
[1] TRUE

IIUC, is.unsorted is intended for atomic vectors only (description of x in
?is.unsorted). Indeed the C source (src/main/sort.c) contains an error
message only atomic vectors can be tested to be sorted. So that is the
error message I expected to see in all cases above, since I know that
data.frame is not an atomic vector. But there is also this in
?is.unsorted: except for atomic vectors and objects with a class (where
the = or  method is used) which I don't understand. Where = or  is
used by what, and where?

I understand why the first two are FALSE (1 item of anything must be
sorted). I don't understand the 3rd and 4th cases where length is 2:
do_isunsorted seems to call lang3(install(.gtn), x, CADR(args))). Does
that fall back to TRUE for some reason?

Matthew

 sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] data.table_1.8.0

loaded via a namespace (and not attached):
[1] tools_2.15.0

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] test suites for packages

2012-05-17 Thread Matthew Dowle

Uwe Ligges ligges at statistik.tu-dortmund.de writes:
 
 On 17.05.2012 16:52, Brian G. Peterson wrote:
  On Thu, 2012-05-17 at 16:32 +0200, Uwe Ligges wrote:
  Yes: R CMD check does the trick. See Writing R Extension and read
  about a package's test directory. I prefer frameworks that do not
  obfuscate failing test results on the CRAN check farm (as most other
  frameworks I have seen).
 
  Uwe:  I don't think that's completely fair.  RUnit and testthat tests
  can be configured to be called from the R package tests directory, so
  that they are run during R CMD check.
 
  They don't *need* to be configured that way, so perhaps that's what
  you're talking about.
 
 I am talking about the problem that relevant output of test failures 
 that may help to identify the problem is frequently not shown in the 
 output of R CMD check when such frameworks are used - that is a major 
 nuisance for CRAN automatisms.

Not sure, but could it be that in some cases the output of test failures is 
there, but chopped off since CRAN displays the 13 line tail? At least that's 
what I've experienced, and reported, and asked to be increased in the past. 
Often the first error causes a cascade, so it's the head you need to see, not 
the tail. If I've got that right, how about a much larger limit than 13, say 
1000. Or the first 50 and last 50 lines of output.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Matthew Dowle


Antonio Piccolboni antonio at piccolboni.info writes:
 Hi,
 I was wondering if there is anything more efficient than split to do the
 kind of conversion in the subject. If I create a data frame as in
 
 system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste(x,
 1:2000, sep =))})
   user  system elapsed
   0.004   0.000   0.004
 
 and then I try to split it
 
  system.time(split(fd, 1:nrow(fd)))
user  system elapsed
   0.333   0.031   0.415
 
 You will be quick to notice the roughly two orders of magnitude difference
 in time between creation and conversion. Granted, it's not written anywhere
 that they should be similar but the latter seems interpreter-slow to me
 (split is implemented with a lapply in the data frame case) There is also a
 memory issue when I hit about 2 elements (allocating 3GB when
 interrupted). So before I resort to Rcpp, despite the electrifying feeling
 of approaching the bare metal and for the sake of getting things done, I
 thought I would ask the experts. Thanks
 
 Antonio

Perhaps r-help or Stack Overflow would have been more appropriate to try first, 
before r-devel. If you did, please say so.

Answering anyway. Do you really want to split every single row? What's the 
bigger picture? Perhaps you don't need to split at all.

On the off chance that the example was just for exposition, and applying some 
(biased) guesswork, have you seen the data.table package? It doesn't use the 
split-apply-combine paradigm because, as your (extreme) example shows, that 
doesn't scale. When you use the 'by' argument of [.data.table, it allocates 
memory once for the largest group. Then it reuses that same memory for each 
group. That's one reason it's fast and memory efficient at grouping (an order 
of magnitude faster than tapply).

Independent timings :
http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/

If you really do want to split every single row, then
DT[,something,by=1:nrow(DT)]
will give perhaps two orders of magnitude speedup, but that's an unfair example 
because it isn't very realistic. Scaling applies to the size of the data.frame, 
and, how much you want to split it up. Your example is extreme in the latter 
but not the former. data.table scales in both.

It's nothing to do with the interpreter, btw, just memory usage.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Byte compilation of packages on CRAN

2012-04-12 Thread Matthew Dowle

 On 11/04/2012 20:36, Matthew Dowle wrote:
 In DESCRIPTION if I set LazyLoad to 'yes' will data.table (for example)
 then be byte compiled for users who install the binary package from CRAN
 on Windows?

 No.  LazyLoad is distinct from byte compilation.  All installed packages
 use lazy loading these days (for simplicity: a very few do not benefit
 from it as they use all their objects at startup).

 This question is based on reading section 1.2 of this document :
 http://www.divms.uiowa.edu/~luke/R/compiler/compiler.pdf
 I've searched r-devel and Stack Overflow history and have found
 questions and answers relating to R CMD INSTALL and install.packages()
 from source, but no answer (as yet) about why binary packages for
 Windows appear not to be byte compiled.
 If so, is there any reason why all packages should not set LazyLoad to
 'yes'. And if not, could LazyLoad be 'yes' by default?

 I wonder why you are not reading R's own documentation.  'Writing R
 Extensions' says

 'The `LazyData' logical field controls whether the R datasets use
 lazy-loading. A `LazyLoad' field was used in versions prior to 2.14.0,
 but now is ignored.

 The `ByteCompile' logical field controls if the package code is
 byte-compiled on installation: the default is currently not to, so this
 may be useful for a package known to benefit particularly from
 byte-compilation (which can take quite a long time and increases the
 installed size of the package).'


Oops, somehow missed that. Thank you!

 Note that the majority of CRAN packages benefit very little from
 byte-compilation because almost all the time of their computations is
 spent in compiled code.  And the increased size also may matter when the
 code is loaded into R.

 Thanks,
 Matthew

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


 --
 Brian D. Ripley,  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Byte compilation of packages on CRAN

2012-04-11 Thread Matthew Dowle

In DESCRIPTION if I set LazyLoad to 'yes' will data.table (for example)
then be byte compiled for users who install the binary package from CRAN
on Windows?
This question is based on reading section 1.2 of this document :
http://www.divms.uiowa.edu/~luke/R/compiler/compiler.pdf
I've searched r-devel and Stack Overflow history and have found
questions and answers relating to R CMD INSTALL and install.packages()
from source, but no answer (as yet) about why binary packages for
Windows appear not to be byte compiled. 
If so, is there any reason why all packages should not set LazyLoad to
'yes'. And if not, could LazyLoad be 'yes' by default?
Thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] CRAN policies

2012-03-30 Thread Matthew Dowle

Mark.Bravington at csiro.au writes:

 There must be over 2000 people who have written CRAN packages by now; every 
extra
 check and non-back-compatible additional requirement runs the risk of 
generating false-negatives and
 incurring many extra person-hours to fix non-problems. Plus someone needs 
to document and explain the
 check (adding to the rule mountain), plus there is the time spent in 
discussions like this..!

Not sure where you're coming from on that. For example, Prof Ripley has added 
quite a few new NOTEs to QC.R over the last few months. These caught things I 
wasn't aware of in the two packages I maintain and I was more than happy to fix 
them. It improves quality, surely.

There's only one particular NOTE causing an issue: 'no visible binding'. If it 
were made a MEMO, we can move on. All the other NOTEs can (and should) be 
fixed, can't they?

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] CRAN policies

2012-03-29 Thread Matthew Dowle

William Dunlap wdunlap at tibco.com writes:

  -Original Message-
  The survival package has a similar special case: the routines for
  expected population survival are set up to accept multiple types of date
  format so have lines like
   if (class(x) == 'chron') { y - as.numeric(x - chron(01/01/1960)}
  This leaves me with two extraneous no visible binding messages.
 
 Suppose we defined a function like
   NO_VISIBLE_BINDING(expr) expr
 and added an entry to the stuff in codetools so that it
 would not check for misspelled object names in call to
 NO_VISIBLE_BINDING.  Then Terry could write that line as
  if (class(x) == chron) { y - as.numeric(x - NO_VISIBLE_BINDING(chron)
(01/01/1960)}
 and the Notes would disappear.
 

That's ok for package code, but what about test suites?  Say there was a test 
on the result of with(DF,a+b), you wouldn't want to change the test to with
(DF,NO_VISIBLE_BINDING(a)+NO_VISIBLE_BINDING(b)) not just because that's long 
and onerous, but because that's *changing* the test i.e. introducing a 
difference between what's tested and what user code will do.

As others suggested, how about a new category: MEMO. The no visible binding 
NOTE would be downgraded to MEMO. CRAN maintainers could then ignore MEMOs more 
easily.

What I really like about NOTES is that when new checks are added to R then as a 
package maintainer you know you don't have to fix them straight away. If a new 
WARNING shows up on r-devel daily checks, however, then you've got some warning 
about the WARNING that you need to fix more urgently and may even accelerate a 
release. So it's not just about checks when submitting a package, but what 
happens afterwards as R itself (and packages in Depends) move on. In other 
words, you know you need to fix new NOTES but not as urgently as new WARNINGS.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] merge bug fix in R 2.15.0

2012-03-15 Thread Matthew Dowle


Anyone?

 Is it intended that the first suffix can no longer be blank? Seems to be
 caused by a bug fix to merge in R 2.15.0.

 $Rdevel --vanilla
 DF1 = data.frame(a=1:3,b=4:6)
 DF2 = data.frame(a=1:3,b=7:9)
 merge(DF1,DF2,by=a,suffixes=c(,.1))
 Error in merge.data.frame(DF1, DF2, by = a, suffixes = c(, .1)) :
   there is already a column named b

 $R --vanilla
 R version 2.14.2 (2012-02-29)
 DF1 = data.frame(a=1:3,b=4:6)
 DF2 = data.frame(a=1:3,b=7:9)
 merge(DF1,DF2,by=a,suffixes=c(,.1))
   a b b.1
 1 1 4   7
 2 2 5   8
 3 3 6   9


 Matthew


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] merge bug fix in R 2.15.0

2012-03-14 Thread Matthew Dowle


Is it intended that the first suffix can no longer be blank? Seems to be
caused by a bug fix to merge in R 2.15.0.

$Rdevel --vanilla
DF1 = data.frame(a=1:3,b=4:6)
DF2 = data.frame(a=1:3,b=7:9)
merge(DF1,DF2,by=a,suffixes=c(,.1))
Error in merge.data.frame(DF1, DF2, by = a, suffixes = c(, .1)) :
  there is already a column named b

$R --vanilla
R version 2.14.2 (2012-02-29)
 DF1 = data.frame(a=1:3,b=4:6)
 DF2 = data.frame(a=1:3,b=7:9)
 merge(DF1,DF2,by=a,suffixes=c(,.1))
  a b b.1
1 1 4   7
2 2 5   8
3 3 6   9


Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] 111 FIXMEs in main/src

2012-03-13 Thread Matthew Dowle

Hi,

We sometimes see offers to contribute, asking what needs to be done. If
they know C, how about the 111 FIXMEs? But which ones would be most
useful to fix? Which are difficult and which are easy? Does R-core have
a process to list and prioritise the FIXMEs?

~/R/Rtrunk/src/main$ grep [^/]FIXME * | wc -l
111
~/R/Rtrunk/src/main$ grep -A 1 [^/]FIXME *
arithmetic.c:/* FIXME: consider using
arithmetic.c-tmp = (long double)x1 - floor(q) * (long double)x2;
--
arithmetic.c:/* FIXME: with the y == 2.0 test now at the top that case
isn't
arithmetic.c-   reached here, but i have left it for someone who
understands the
--
arithmetic.c:/* FIXME: Danger Will Robinson.
arithmetic.c- * -  We might be trashing arguments here.
--
array.c:/* FIXME: the following is desirable, but pointless as long as
array.c-   subset.c  others have a contrary version that leaves the
--
attrib.c:/* FIXME:  1.e-5 should rather be == option('ts.eps') !! */
attrib.c-if (fabs(end - start - (n - 1)/frequency)  1.e-5)
--
attrib.c:   /* FIXME : The whole classgets may as well die. */
attrib.c-
--
attrib.c:/* FIXME */
attrib.c-if (nvalues = 0)
--
attrib.c:/* FIXME */
attrib.c-PROTECT(namesattr);
--
attrib.c:/* FIXME: the code below treats pair-based structures */
attrib.c-/* in a special way.  This can probably be dropped down */
--
base.c:/* FIXME: Make this a macro to avoid function call overhead?
base.c-   Inline it if you really think it matters.
--
bind.c:/* FIXME : is there another possibility? */
bind.c-
--
bind.c: /* FIXME: I'm not sure what the author intended when the
sequence was
bind.c-defined as raw  logical -- it is possible to represent
logical as
--
builtin.c:  /* FIXME -- Rstrlen allows for double-width chars */
builtin.c-  width += Rstrlen(STRING_ELT(labs, nlines % lablen), 0) 
+ 1;
--
builtin.c:/* FIXME:  call EncodeElement() for every element of
s.
builtin.c-
--
builtin.c:  /* FIXME : cat(...) should handle ANYTHING */
builtin.c-  w = strlen(p);
--
character.c:slen = strlen(ss); /* FIXME -- should handle embedded
nuls */
character.c-buf = R_AllocStringBuffer(slen+1, cbuff);
--
character.c:   FIXME: could prefer UTF-8 here
character.c- */
--
character.c:/* FIXME: could use R_Realloc instead */
character.c-cbuf = CallocCharBuf(strlen(tmp) + 1);
--
character.c:/* FIXME use this buffer for new string as well */
character.c-wc = (wchar_t *)
--
coerce.c:/* FIXME: Use
coerce.c-   =
--
complex.c:/* FIXME: maybe add full IEC60559 support */
complex.c-static double complex clog(double complex x)
--
complex.c:/* FIXME: check/add full IEC60559 support */
complex.c-static double complex cexp(double complex x)
--
connections.c:/* FIXME: is this correct for consoles? */
connections.c-checkArity(op, args);
--
connections.c:/* FIXME: could do any MBCS locale, but would need
pushback */
connections.c-static SEXP
--
connections.c:  outlen = 1.01 * inlen + 600; /* FIXME, copied from bzip2
*/
connections.c-  buf = (unsigned char *) R_alloc(outlen, sizeof(unsigned
char));
--
datetime.c: /* FIXME some of this should be outside the loop */
datetime.c- int ns, nused = 4;
--
dcf.c:  /* FIXME:
dcf.c- Why are we doing this?
--
debug.c:/* FIXME: previous will have 0x whereas other values
are
debug.c-   without the   */
--
deriv.c:/* FIXME: simplify exp(lgamma( E )) = gamma( E ) */
deriv.c-ans = lang2(ExpSymbol, arg1);
--
deriv.c:/* FIXME: simplify log(gamma( E )) = lgamma( E ) */
deriv.c-ans = lang2(LogSymbol, arg1);
--
deriv.c:/* FIXME */
deriv.c-#ifdef NOTYET
--
devices.c:/* FIXME Disable this for now */
devices.c-/*
--
devices.c:/* FIXME: There should really be a formal graphics
finaliser
devices.c- * but this is a good proxy for now.
--
devices.c:/* FIXME:  there should be a way for a device to declare
its own
devices.c-   events, and return information on how to set
them */
--
dounzip.c: filename is in UTF-8, so FIXME */
dounzip.c-  SET_STRING_ELT(names, i, mkChar(filename_inzip));
--
duplicate.c:   FIXME: surely memcpy would be faster here?
duplicate.c-*/
--
engine.c:/* FIXME: what about clipping? (if the device can't) 
engine.c-*/
--
engine.c:/* FIXME: what about clipping? (if the device can't) 
engine.c- * Maybe not too bad because it is just a matter of shaving
off
--
engine.c:   /* FIXME: This assumes that 
wchar_t is UCS-2/4,
engine.c-  since that is what 
GEMetricInfo expects */
--
engine.c:/* FIXME: should we warn on more than one character here? */
engine.c-int GEstring_to_pch(SEXP pch)
--
envir.c:  FIXME ? should this also

[Rd] Identical copy of base function

2012-02-27 Thread Matthew Dowle


Hello,

Regarding this in R-devel/NEWS/New features :

o  library(pkg) no longer warns about a conflict with a function from
package:base if the function is an identical copy of the base one but
with a different environment.

Why would one want an identical copy in a different environment? I'm
thinking I may be missing out on a trick here.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] names- appears to copy 3 times?

2012-01-17 Thread Matthew Dowle

Hi,

$ R --vanilla
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)
 DF = data.frame(a=1:3,b=4:6)
 DF
  a b
1 1 4
2 2 5
3 3 6
 tracemem(DF)
[1] 0x8898098
 names(DF)[2]=B
tracemem[0x8898098 - 0x8763e18]: 
tracemem[0x8763e18 - 0x8766be8]: 
tracemem[0x8766be8 - 0x8766b68]: 
 DF
  a B
1 1 4
2 2 5
3 3 6
 

Are those 3 copies really taking place? 

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Confused about NAMED

2011-11-24 Thread Matthew Dowle

Hi,

I expected NAMED to be 1 in all these three cases. It is for one of them,
but not the other two?

 R --vanilla
R version 2.14.0 (2011-10-31)
Platform: i386-pc-mingw32/i386 (32-bit)

 x = 1L
 .Internal(inspect(x))   # why NAM(2)? expected NAM(1)
@2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1

 y = 1:10
 .Internal(inspect(y))   # NAM(1) as expected but why different to x?
@272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,...

 z = data.frame()
 .Internal(inspect(z))   # why NAM(2)? expected NAM(1)
@24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0)
ATTRIB:
  @24fc270 02 LISTSXP g0c0 []
TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names
@24fc334 16 STRSXP g0c0 [] (len=0, tl=0)
TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names
@24fc318 13 INTSXP g0c0 [] (len=0, tl=0)
TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class
@25be500 16 STRSXP g0c1 [] (len=1, tl=0)
  @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame

It's a little difficult to search for the word named but I tried and
found this in R-ints :

Note that optimizing NAMED = 1 is only effective within a primitive
(as the closure wrapper of a .Internal will set NAMED = 2 when the
promise to the argument is evaluated)

So might it be that just looking at NAMED using .Internal(inspect()) is
setting NAMED=2?  But if so, why does y have NAMED==1?

Thanks!
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Confused about NAMED

2011-11-24 Thread Matthew Dowle


 On Nov 24, 2011, at 11:13 , Matthew Dowle wrote:

 Hi,

 I expected NAMED to be 1 in all these three cases. It is for one of
 them,
 but not the other two?

 R --vanilla
 R version 2.14.0 (2011-10-31)
 Platform: i386-pc-mingw32/i386 (32-bit)

 x = 1L
 .Internal(inspect(x))   # why NAM(2)? expected NAM(1)
 @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1

 y = 1:10
 .Internal(inspect(y))   # NAM(1) as expected but why different to x?
 @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,...

 z = data.frame()
 .Internal(inspect(z))   # why NAM(2)? expected NAM(1)
 @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0)
 ATTRIB:
  @24fc270 02 LISTSXP g0c0 []
TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names
@24fc334 16 STRSXP g0c0 [] (len=0, tl=0)
TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names
@24fc318 13 INTSXP g0c0 [] (len=0, tl=0)
TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class
@25be500 16 STRSXP g0c1 [] (len=1, tl=0)
  @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame

 It's a little difficult to search for the word named but I tried and
 found this in R-ints :

Note that optimizing NAMED = 1 is only effective within a primitive
 (as the closure wrapper of a .Internal will set NAMED = 2 when the
 promise to the argument is evaluated)

 So might it be that just looking at NAMED using .Internal(inspect()) is
 setting NAMED=2?  But if so, why does y have NAMED==1?

 This is tricky business... I'm not quite sure I'll get it right, but let's
 try

 When you are assigning a constant, the value you assign is already part of
 the assignment expression, so if you want to modify it, you must
 duplicate. So NAMED==2 on z - 1 is basically to prevent you from
 accidentally changing the value of 1. If it weren't, then you could get
 bitten by code like for(i in 1:2) {z - 1; if(i==1) z[1] - 2}.

 If you're assigning the result of a computation, then the object only
 exists once, so
 z - 0+1  gets NAMED==1.

 However, if the computation is done by returning a named value from within
 a function, as in

 f - function(){v - 1+0; v}
 z - f()

 then again NAMED==2. This is because the side effects of the function
 _might_ result in something having a hold on the function environment,
 e.g. if we had

 e - NULL
 f - function(){e -environment(); v - 1+0; v}
 z - f()

 then z[1] - 5 would change e$v too. As it happens, there aren't any side
 effects in the forme case, but R loses track and assumes the worst.


Thanks a lot, think I follow. That explains x vs y, but why is z NAMED==2?
The result of data.frame() is an object that exists once (similar to 1:10)
so shouldn't it be NAMED==1 too?  Or, R loses track and assumes the worst
even on its own functions such as data.frame()?


 Thanks!
 Matthew

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

 --
 Peter Dalgaard, Professor
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Confused about NAMED

2011-11-24 Thread Matthew Dowle


 On Nov 24, 2011, at 12:34 , Matthew Dowle wrote:


 On Nov 24, 2011, at 11:13 , Matthew Dowle wrote:

 Hi,

 I expected NAMED to be 1 in all these three cases. It is for one of
 them,
 but not the other two?

 R --vanilla
 R version 2.14.0 (2011-10-31)
 Platform: i386-pc-mingw32/i386 (32-bit)

 x = 1L
 .Internal(inspect(x))   # why NAM(2)? expected NAM(1)
 @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1

 y = 1:10
 .Internal(inspect(y))   # NAM(1) as expected but why different to x?
 @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,...

 z = data.frame()
 .Internal(inspect(z))   # why NAM(2)? expected NAM(1)
 @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0)
 ATTRIB:
 @24fc270 02 LISTSXP g0c0 []
   TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names
   @24fc334 16 STRSXP g0c0 [] (len=0, tl=0)
   TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names
   @24fc318 13 INTSXP g0c0 [] (len=0, tl=0)
   TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class
   @25be500 16 STRSXP g0c1 [] (len=1, tl=0)
 @1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame

 It's a little difficult to search for the word named but I tried and
 found this in R-ints :

   Note that optimizing NAMED = 1 is only effective within a primitive
 (as the closure wrapper of a .Internal will set NAMED = 2 when the
 promise to the argument is evaluated)

 So might it be that just looking at NAMED using .Internal(inspect())
 is
 setting NAMED=2?  But if so, why does y have NAMED==1?

 This is tricky business... I'm not quite sure I'll get it right, but
 let's
 try

 When you are assigning a constant, the value you assign is already part
 of
 the assignment expression, so if you want to modify it, you must
 duplicate. So NAMED==2 on z - 1 is basically to prevent you from
 accidentally changing the value of 1. If it weren't, then you could
 get
 bitten by code like for(i in 1:2) {z - 1; if(i==1) z[1] - 2}.

 If you're assigning the result of a computation, then the object only
 exists once, so
 z - 0+1  gets NAMED==1.

 However, if the computation is done by returning a named value from
 within
 a function, as in

 f - function(){v - 1+0; v}
 z - f()

 then again NAMED==2. This is because the side effects of the function
 _might_ result in something having a hold on the function environment,
 e.g. if we had

 e - NULL
 f - function(){e -environment(); v - 1+0; v}
 z - f()

 then z[1] - 5 would change e$v too. As it happens, there aren't any
 side
 effects in the forme case, but R loses track and assumes the worst.


 Thanks a lot, think I follow. That explains x vs y, but why is z
 NAMED==2?
 The result of data.frame() is an object that exists once (similar to
 1:10)
 so shouldn't it be NAMED==1 too?  Or, R loses track and assumes the
 worst
 even on its own functions such as data.frame()?

 R loses track. I suspect that is really all it can do without actual
 reference counting. The function data.frame is more than 150 lines of
 code, and if any of those end up invoking user code, possibly via a class
 method, you can't tell definitively whether or not the evaluation
 environment dies at the return.

Ohhh, think I see now. After Duncan's reply I was going to ask if it was
possible to change data.frame() to be primitive so it could set NAMED=1.
But it seems primitive functions can't use R code so data.frame() would
need to be ported to C. Ok! - not quick or easy, and not without
consideable risk. And, data.frame() can invoke user code inside it anyway
then.

Since list() is primitive I tried to construct a data.frame starting with
list() [since structure() isn't primitive], but then merely adding an
attribute seems to set NAMED==2 too ?

 DF = list(a=1:3,b=4:6)
 .Internal(inspect(DF)) # so far so good: NAM(1)
@25149e0 19 VECSXP g0c1 [NAM(1),ATT] (len=2, tl=0)
  @263ea50 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @263eaa0 13 INTSXP g0c2 [] (len=3, tl=0) 4,5,6
ATTRIB:
  @2457984 02 LISTSXP g0c0 []
TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names
@25149c0 16 STRSXP g0c1 [] (len=2, tl=0)
  @1e987d8 09 CHARSXP g0c1 [MARK,gp=0x21] a
  @1e56948 09 CHARSXP g0c1 [MARK,gp=0x21] b

 attr(DF,foo) - bar# just adding an attribute sets NAM(2) ?
 .Internal(inspect(DF))
@25149e0 19 VECSXP g0c1 [NAM(2),ATT] (len=2, tl=0)
  @263ea50 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @263eaa0 13 INTSXP g0c2 [] (len=3, tl=0) 4,5,6
ATTRIB:
  @2457984 02 LISTSXP g0c0 []
TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names
@25149c0 16 STRSXP g0c1 [] (len=2, tl=0)
  @1e987d8 09 CHARSXP g0c1 [MARK,gp=0x21] a
  @1e56948 09 CHARSXP g0c1 [MARK,gp=0x21] b
TAG: @245732c 01 SYMSXP g0c0 [] foo
@25148a0 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
  @2514920 09 CHARSXP g0c1 [gp=0x20] bar


Matthew


 --
 Peter Dalgaard, Professor
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com

Re: [Rd] Confused about NAMED

2011-11-24 Thread Matthew Dowle


 On Nov 24, 2011, at 14:05 , Matthew Dowle wrote:

 Since list() is primitive I tried to construct a data.frame starting
 with
 list() [since structure() isn't primitive], but then merely adding an
 attribute seems to set NAMED==2 too ?

 Yes. As soon as there is the slightest risk of having (had) two references
 to the same object NAMED==2 and it is never reduced. While your mind is
 boggling, I might boggle it a bit more:

 z - 1:10
 .Internal(inspect(z))
 @116e11788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,...
 m - mean(z)
 .Internal(inspect(z))
 @116e11788 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...

 This happens because while mean() is running, there is a second reference
 to z, namely mean's argument x. (With languages like R, you have no
 insurance that there will be no changes to the global environment while a
 function call is being evaluated, so bugs can bite in both places -- z or
 x.)

 There are many of these cases where you might pragmatically want to
 override the default NAMED logic, but you'd be stepping into treacherous
 waters. Luke has probably been giving these matters quite some thought in
 connection with his compiler project.

Ok, very interesting. Think I'm there.
Thanks for all the info.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Confused about NAMED

2011-11-24 Thread Matthew Dowle


 On Nov 24, 2011, at 8:05 AM, Matthew Dowle wrote:


 On Nov 24, 2011, at 12:34 , Matthew Dowle wrote:


 On Nov 24, 2011, at 11:13 , Matthew Dowle wrote:

 Hi,

 I expected NAMED to be 1 in all these three cases. It is for one of
 them,
 but not the other two?

 R --vanilla
 R version 2.14.0 (2011-10-31)
 Platform: i386-pc-mingw32/i386 (32-bit)

 x = 1L
 .Internal(inspect(x))   # why NAM(2)? expected NAM(1)
 @2514aa0 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1

 y = 1:10
 .Internal(inspect(y))   # NAM(1) as expected but why different to
 x?
 @272f788 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,2,3,4,5,...

 z = data.frame()
 .Internal(inspect(z))   # why NAM(2)? expected NAM(1)
 @24fc28c 19 VECSXP g0c0 [OBJ,NAM(2),ATT] (len=0, tl=0)
 ATTRIB:
 @24fc270 02 LISTSXP g0c0 []
  TAG: @3f2120 01 SYMSXP g0c0 [MARK,gp=0x4000] names
  @24fc334 16 STRSXP g0c0 [] (len=0, tl=0)
  TAG: @3f2040 01 SYMSXP g0c0 [MARK,gp=0x4000] row.names
  @24fc318 13 INTSXP g0c0 [] (len=0, tl=0)
  TAG: @3f2388 01 SYMSXP g0c0 [MARK,gp=0x4000] class
  @25be500 16 STRSXP g0c1 [] (len=1, tl=0)
@1d38af0 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] data.frame

 It's a little difficult to search for the word named but I tried
 and
 found this in R-ints :

  Note that optimizing NAMED = 1 is only effective within a
 primitive
 (as the closure wrapper of a .Internal will set NAMED = 2 when the
 promise to the argument is evaluated)

 So might it be that just looking at NAMED using .Internal(inspect())
 is
 setting NAMED=2?  But if so, why does y have NAMED==1?

 This is tricky business... I'm not quite sure I'll get it right, but
 let's
 try

 When you are assigning a constant, the value you assign is already
 part
 of
 the assignment expression, so if you want to modify it, you must
 duplicate. So NAMED==2 on z - 1 is basically to prevent you from
 accidentally changing the value of 1. If it weren't, then you could
 get
 bitten by code like for(i in 1:2) {z - 1; if(i==1) z[1] - 2}.

 If you're assigning the result of a computation, then the object only
 exists once, so
 z - 0+1  gets NAMED==1.

 However, if the computation is done by returning a named value from
 within
 a function, as in

 f - function(){v - 1+0; v}
 z - f()

 then again NAMED==2. This is because the side effects of the function
 _might_ result in something having a hold on the function
 environment,
 e.g. if we had

 e - NULL
 f - function(){e -environment(); v - 1+0; v}
 z - f()

 then z[1] - 5 would change e$v too. As it happens, there aren't any
 side
 effects in the forme case, but R loses track and assumes the worst.


 Thanks a lot, think I follow. That explains x vs y, but why is z
 NAMED==2?
 The result of data.frame() is an object that exists once (similar to
 1:10)
 so shouldn't it be NAMED==1 too?  Or, R loses track and assumes the
 worst
 even on its own functions such as data.frame()?

 R loses track. I suspect that is really all it can do without actual
 reference counting. The function data.frame is more than 150 lines of
 code, and if any of those end up invoking user code, possibly via a
 class
 method, you can't tell definitively whether or not the evaluation
 environment dies at the return.

 Ohhh, think I see now. After Duncan's reply I was going to ask if it was
 possible to change data.frame() to be primitive so it could set NAMED=1.
 But it seems primitive functions can't use R code so data.frame() would
 need to be ported to C. Ok! - not quick or easy, and not without
 consideable risk. And, data.frame() can invoke user code inside it
 anyway
 then.

 Since list() is primitive I tried to construct a data.frame starting
 with
 list() [since structure() isn't primitive], but then merely adding an
 attribute seems to set NAMED==2 too ?


 Yes, because attr(x,y) - z is the same as

 `*tmp*` - x
 x - `attr-`(`*tmp*`, y, z)
 rm(`*tmp*`)

 so there are two references to the data frame: one in DF and one in
 `*tmp*`. It is the first line that causes the NAMED bump. And, yes, it's
 real:

 `f-`=function(x,value) { print(ls(parent.frame())); x-value }
 x=1
 f(x)=1
 [1] *tmp* f-   x

 You could skip that by using the function directly (I don't think it's
 recommended, though):

 .Internal(inspect(l - list(a=1)))
 @1028c82f8 19 VECSXP g0c1 [NAM(1),ATT] (len=1, tl=0)
   @1028c8268 14 REALSXP g0c1 [] (len=1, tl=0) 1
 ATTRIB:
   @100b6e748 02 LISTSXP g0c0 []
 TAG: @100843878 01 SYMSXP g0c0 [MARK,gp=0x4000] names
 @1028c82c8 16 STRSXP g0c1 [] (len=1, tl=0)
   @1009cd388 09 CHARSXP g0c1 [MARK,gp=0x21] a
 .Internal(inspect(`names-`(l, b)))
 @1028c82f8 19 VECSXP g0c1 [NAM(1),ATT] (len=1, tl=0)
   @1028c8268 14 REALSXP g0c1 [] (len=1, tl=0) 1
 ATTRIB:
   @100b6e748 02 LISTSXP g0c0 []
 TAG: @100843878 01 SYMSXP g0c0 [MARK,gp=0x4000] names
 @1028c8178 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
   @100967af8 09 CHARSXP g0c1 [MARK,gp=0x20] b
 .Internal(inspect(l))
 @1028c82f8 19 VECSXP g0c1 [NAM(1),ATT] (len=1, tl=0)
   @1028c8268 14 REALSXP g0c1 [] (len=1, tl=0) 1

Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Matthew Dowle

Stavros Macrakis macrakis at alum.mit.edu writes:
 
 data.table certainly has some useful mechanisms, and I've been
 experimenting with it as an implementation mechanism, though it's not a
 drop-in substitute for factors.  Also, though it is efficient for set
 operations between small sets and large sets, it is not very efficient for
 operations between two large sets

As a general statement that could do with some clarification ;) data.table 
likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I 
believe) efficient for joining two large 2+ column keyed data sets because the 
upper bound of each row's one-sided binary search is localised in that case (by 
group of the previous key column).

As I understand it, Stavros has a different type of 'two large datasets' : 
English language website data. Each set is one large vector of uniformly 
distributed unique strings. That appears to be quite a different problem to 
multiple columns of many times duplicated data.

Matthew

 Thanks everyone, and if you do come across a relevant CRAN package, I'd be
 very interested in hearing about it.
 
   -s


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Contributors on R-Forge

2011-10-21 Thread Matthew Dowle


Milan Bouchet-Valat nalimi...@club.fr wrote in message 
news:1319202026.9174.6.camel@milan...
 Le vendredi 21 octobre 2011 à 13:39 +0100, Charles Roosen a écrit :
 Hi,


 I've recently taken over maintenance for the xtable package, and have
 set it up on R-Forge.  At the moment I'm pondering what the best way is
 to handle submitted patches.  Basically, is it better to:


 1)  Be non-restrictive regarding committer status, let individuals
 change the code with minimal pre-commit review, and figure changes can
 be reviewed before release.

 2)  Accept patches and basically log them as issues to look at in
 detail before putting them in.
 I'd say you'd better review patches before they go in, as it would be
 quite ugly to fix things afterwards, right before the release. If a
 patch is buggy, better catch problems early instead of waiting for
 changes to add up: then, it will be harder to find out the origin of the
 bug. It also allows you to spot small issues like styling and
 indentation, that you wouldn't bother to fix once they've been
 committed.

 You can give people committer status, but ask them to post their patches
 as issues before committing. This reduces the burden imposed on the
 reviewer/maintainer.

My view :

1) Yes, be non-restrictive but impose some ground rules :

i) each commit should pass 'R CMD check'

ii) each new feature or bug fix should have an associated
test added to the test suite (run by R CMD check), and an
item added to NEWS (by the committer).

iii) all developers subscribe to the -commits list and review
each commit in a timely manner when the unified diff arrives
in your inbox. If something is wrong or forgotten, ask the
committer to fix it there and then.

Matthew



 Regards

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Possible to read R_StringHash from a package?

2011-08-30 Thread Matthew Dowle

Is there any way to look at R_StringHash from a package? I've read
R-Ints 1.16.1 Hiding C entry points and seen that R_StringHash is
declared as extern0 in Defn.h. So it seems the answer is no.

Thanks,
Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Manipulating single-precision (float) arrays in .Callfunctions

2011-07-20 Thread Matthew Dowle


Duncan Murdoch murdoch.dun...@gmail.com wrote in message 
news:4e259600.5070...@gmail.com...
 On 11-07-19 7:48 AM, Matthew Dowle wrote:

 Prof Brian Ripleyrip...@stats.ox.ac.uk  wrote in message
 news:alpine.lfd.2.02.1107190640280.28...@gannet.stats.ox.ac.uk...
 On Mon, 18 Jul 2011, Alireza Mahani wrote:

 Simon,

 Thank you for elaborating on the limitations of R in handling float
 types. I
 think I'm pretty much there with you.

 As for the insufficiency of single-precision math (and hence 
 limitations
 of
 GPU), my personal take so far has been that double-precision becomes
 crucial
 when some sort of error accumulation occurs. For example, in 
 differential
 equations where boundary values are integrated to arrive at interior
 values,
 etc. On the other hand, in my personal line of work (Hierarchical
 Bayesian
 models for quantitative marketing), we have so much inherent 
 uncertainty
 and
 noise at so many levels in the problem (and no significant error
 accumulation sources) that single vs double precision issue is often
 inconsequential for us. So I think it really depends on the field as 
 well
 as
 the nature of the problem.

 The main reason to use only double precision in R was that on modern 
 CPUs
 double precision calculations are as fast as single-precision ones, and
 with 64-bit CPUs they are a single access.
 So the extra precision comes more-or-less for free.

 But, isn't it much more of the 'less free' when large data sets are
 considered? If a double matrix takes 3GB, it's 1.5GB in single.
 That might alleviate the dreaded out-of-memory error for some
 users in some circumstances. On 64bit, 50GB reduces to 25GB
 and that might make the difference between getting
 something done, or not. If single were appropriate, of course.
 For GPU too, i/o often dominates iiuc.

 For space reasons, is there any possibility of R supporting single
 precision (and single bit logical to reduce memory for logicals by
 32 times)? I guess there might be complaints from users using
 single inappropriately (or worse, not realising we have an instable
 result due to single).

 You can do any of this using external pointers now.  That will remind you 
 that every single function to operate on such objects needs to be 
 rewritten.

 It's a huge amount of work, benefiting very few people.  I don't think 
 anyone in R Core will do it.

 Duncan Murdoch

I've been informed off list about the 'bit' package, which seems
great and answers my parenthetic complaint (at least).

http://cran.r-project.org/web/packages/bit/index.html

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Manipulating single-precision (float) arrays in .Call functions

2011-07-19 Thread Matthew Dowle

Prof Brian Ripley rip...@stats.ox.ac.uk wrote in message
news:alpine.lfd.2.02.1107190640280.28...@gannet.stats.ox.ac.uk...
On Mon, 18 Jul 2011, Alireza Mahani wrote:

Simon,

Thank you for elaborating on the limitations of R in handling float
types. I
think I'm pretty much there with you.

As for the insufficiency of single-precision math (and hence limitations
of
GPU), my personal take so far has been that double-precision becomes
crucial
when some sort of error accumulation occurs. For example, in differential
equations where boundary values are integrated to arrive at interior
values,
etc. On the other hand, in my personal line of work (Hierarchical
Bayesian
models for quantitative marketing), we have so much inherent uncertainty
and
noise at so many levels in the problem (and no significant error
accumulation sources) that single vs double precision issue is often
inconsequential for us. So I think it really depends on the field as well
as
the nature of the problem.

The main reason to use only double precision in R was that on modern CPUs
double precision calculations are as fast as single-precision ones, and
with 64-bit CPUs they are a single access.
So the extra precision comes more-or-less for free.

But, isn't it much more of the 'less free' when large data sets are
considered? If a double matrix takes 3GB, it's 1.5GB in single.
That might alleviate the dreaded out-of-memory error for some
users in some circumstances. On 64bit, 50GB reduces to 25GB
and that might make the difference between getting
something done, or not. If single were appropriate, of course.
For GPU too, i/o often dominates iiuc.

For space reasons, is there any possibility of R supporting single
precision (and single bit logical to reduce memory for logicals by
32 times)? I guess there might be complaints from users using
single inappropriately (or worse, not realising we have an instable
result due to single).

Matthew

You also under-estimate the extent to which stability of commonly used
algorithms relies on double precision. (There are stable single-precision
versions, but they are no longer commonly used. And as Simon said, in
some cases stability is ensured by using extra precision where available.)

I disagree slightly with Simon on GPUs: I am told by local experts that
the double-precision on the latest GPUs (those from the last year or so)
is perfectly usable. See the performance claims on
http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP
performance in DP.

Regards,
Alireza

--
View this message in context:
http://r.789695.n4.nabble.com/Manipulating-single-precision-float-arrays-in-Call-functions-tp3675684p3677232.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

--
Brian D. Ripley, rip...@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax: +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Manipulating single-precision (float) arrays in .Callfunctions

2011-07-19 Thread Matthew Dowle

Duncan Murdoch murdoch.dun...@gmail.com wrote in message
news:4e259600.5070...@gmail.com...
On 11-07-19 7:48 AM, Matthew Dowle wrote:

Prof Brian Ripleyrip...@stats.ox.ac.uk wrote in message
news:alpine.lfd.2.02.1107190640280.28...@gannet.stats.ox.ac.uk...
On Mon, 18 Jul 2011, Alireza Mahani wrote:

Simon,

Thank you for elaborating on the limitations of R in handling float
types. I
think I'm pretty much there with you.

As for the insufficiency of single-precision math (and hence
limitations
of
GPU), my personal take so far has been that double-precision becomes
crucial
when some sort of error accumulation occurs. For example, in
differential
equations where boundary values are integrated to arrive at interior
values,
etc. On the other hand, in my personal line of work (Hierarchical
Bayesian
models for quantitative marketing), we have so much inherent
uncertainty
and
noise at so many levels in the problem (and no significant error
accumulation sources) that single vs double precision issue is often
inconsequential for us. So I think it really depends on the field as
well
as
the nature of the problem.

The main reason to use only double precision in R was that on modern
CPUs
double precision calculations are as fast as single-precision ones, and
with 64-bit CPUs they are a single access.
So the extra precision comes more-or-less for free.

You can do any of this using external pointers now. That will remind you
that every single function to operate on such objects needs to be
rewritten.

It's a huge amount of work, benefiting very few people. I don't think
anyone in R Core will do it.

Ok, thanks for the responses.
Matthew

Duncan Murdoch

Matthew

You also under-estimate the extent to which stability of commonly used
algorithms relies on double precision. (There are stable
single-precision
versions, but they are no longer commonly used. And as Simon said, in
some cases stability is ensured by using extra precision where
available.)

Regards,
Alireza

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Matthew Dowle

 Matthew,

 I was hoping I misunderstood you first proposal, but I suspect I did not
 ;).

 Personally, I find  DT[1,V1 - 3] highly disturbing - I would expect it to
 evaluate to
 { V1 - 3; DT[1, V1] }
 thus returning the first element of the third column.

Please see FAQ 1.1, since further below it seems to be an expectation
issue about 'with' syntax, too.


 That said, I don't think it works, either. Taking you example and
 data.table form r-forge:
[ snip ]
 as you can see, DT is not modified.

Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
the non-working state I'll need some more environment information please.

 Also I suspect there is something quite amiss because even trivial things
 don't work:

 DF[1:4,1:4]
   V1 V2 V3 V4
 1  3  1  1  1
 2  1  1  1  1
 3  1  1  1  1
 4  1  1  1  1
 DT[1:4,1:4]
 [1] 1 2 3 4

That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
and 1.10.


 When I first saw your proposal, I thought you have rather something like
 within(DT, V1[1] - 3)
 in mind which looks innocent enough but performs terribly (note that I had
 to scale down the loop by a factor of 100!!!):

 system.time(for (i in 1:10) within(DT, V1[1] - 3))
user  system elapsed
   2.701   4.437   7.138

No, since 'with' is already built into data.table, I was thinking of
building 'within' in, too. I'll take a look at within(). Might as well
provide as many options as possible to the user to use as they wish.

 With the for loop something like within(DF, for (i in 1:1000) V1[i] - 3))
 performs reasonably:

 system.time(within(DT, for (i in 1:1000) V1[i] - 3))
user  system elapsed
   0.392   0.613   1.003

 (Note: system.time() can be misleading when within() is involved, because
 the expression is evaluated in a different environment so within() won't
 actually change the object in the  global environment - it also interacts
 with the possible duplication)

Noted, thanks. That's pretty fast. Does within() on data.frame fix the
original issue Ivo raised, then?  If so, job done.


 Cheers,
 Simon

 On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

 Thanks for the replies and info. An attempt at fast
 assign is now committed to data.table v1.6.3 on
 R-Forge. From NEWS :

 o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.

Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.

[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.

 o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,

DT[i,colname-value]

This syntax makes no copies of any part of memory at all.

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed
287.730 323.196 613.453
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed
  1.152   0.004   1.161 # 528 times faster

 Please note :

***
**  Within syntax is presently highly experimental.  **
***

 http://datatable.r-forge.r-project.org/


 On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:

 Interesting, and I stand corrected:

 x = data.frame(a=1:n,b=1:n)
 .Internal(inspect(x))
 @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[1,1]=42L
 .Internal(inspect(x))
 @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


 I have R to release ;) so I won't be looking into this right now, but
 it's something worth investigating ... Since all the inner contents
 have NAMED=0 I would not expect any duplication to be needed, but
 apparently becomes so is at some point ...


 The internals

Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Matthew Dowle

Thanks for the replies and info. An attempt at fast
assign is now committed to data.table v1.6.3 on
R-Forge. From NEWS :

o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.

Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.

[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.

o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,
 
DT[i,colname-value]

This syntax makes no copies of any part of memory at all.

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed 
287.730 323.196 613.453 
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed 
  1.152   0.004   1.161 # 528 times faster

Please note :

***
**  Within syntax is presently highly experimental.  **
***

http://datatable.r-forge.r-project.org/


On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:
 
  Interesting, and I stand corrected:
 
  x = data.frame(a=1:n,b=1:n)
  .Internal(inspect(x))
  @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
   @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
   @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
  x[1,1]=42L
  .Internal(inspect(x))
  @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
   @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
   @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
  x[[1]][1]=42L
  .Internal(inspect(x))
  @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
   @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
   @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...
 
  x[[1]][1]=42L
  .Internal(inspect(x))
  @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
   @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
   @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 
 
  I have R to release ;) so I won't be looking into this right now, but it's 
  something worth investigating ... Since all the inner contents have NAMED=0 
  I would not expect any duplication to be needed, but apparently becomes so 
  is at some point ...
 
 
 The internals assume in various places that deep copies are made (one
 of the reasons NAMED setings are not propagated to sub-sturcture).
 The main issues are avoiding cycles and that there is no easy way to
 check for sharing.  There may be some circumstances in which a shallow
 copy would be OK but making sure it would be in all cases is probably
 more trouble than it is worth at this point. (I've tried this in the
 past in a few cases and always had to back off.)
 
 
 Best,
 
 luke
 
 
  Cheers,
  Simon
 
 
  On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
 
 
  On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
  No subassignment function satisfies that condition, because you can 
  always call them directly. However, that doesn't stop the default method 
  from making that assumption, so I'm not sure it's an issue.
 
  David, Just to clarify - the data frame content is not copied, we are 
  talking about the vector holding columns.
 
  If it is just the vector holding the columns that is copied (and not the
  columns themselves), why does n make a difference in this test (on R
  2.13.0)?
 
  n = 1000
  x = data.frame(a=1:n,b=1:n)
  system.time(for (i in 1:1000) x[1,1] - 42L)
user  system elapsed
   0.628   0.000   0.628
  n = 10
  x = data.frame(a=1:n,b=1:n)  # still 2 columns, but longer columns
  system.time(for (i in 1:1000) x[1,1] - 42L)
user  system elapsed
  20.145   1.232  21.455
 
 
  With $- :
 
  n = 1000
  x = data.frame(a=1:n,b=1:n)
  system.time(for (i in 1:1000) x$a[1] - 42L)
user  system elapsed
   0.304   0.000   0.307
  n = 10
  x = data.frame(a=1:n,b=1:n)
  system.time(for (i in 1:1000) x$a[1] - 42L)
user  system elapsed
  37.586   0.388  38.161
 
 
  If it's because the 1st column needs to be copied (only) because that's
  the one being assigned to (in this test), that magnitude of slow down
  doesn't seem consistent with the time of a vector copy of the 1st
  column :
 
  n=10
  v = 1:n
  system.time(for (i in 1:1000) v[1] - 42L)
user

Re: [Rd] [datatable-help] speeding up perception

2011-07-12 Thread Matthew Dowle


Simon,
If you didn't install.packages() with method=source from R-Forge, that
would explain (some of) it. R-Forge builds binaries once each night. This
commit was long after the cutoff.
Matthew

 Matthew,

 I was hoping I misunderstood you first proposal, but I suspect I did not
 ;).

 Personally, I find  DT[1,V1 - 3] highly disturbing - I would expect it
 to
 evaluate to
 { V1 - 3; DT[1, V1] }
 thus returning the first element of the third column.

 Please see FAQ 1.1, since further below it seems to be an expectation
 issue about 'with' syntax, too.


 That said, I don't think it works, either. Taking you example and
 data.table form r-forge:
 [ snip ]
 as you can see, DT is not modified.

 Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
 the non-working state I'll need some more environment information please.

 Also I suspect there is something quite amiss because even trivial
 things
 don't work:

 DF[1:4,1:4]
   V1 V2 V3 V4
 1  3  1  1  1
 2  1  1  1  1
 3  1  1  1  1
 4  1  1  1  1
 DT[1:4,1:4]
 [1] 1 2 3 4

 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
 and 1.10.


 When I first saw your proposal, I thought you have rather something like
 within(DT, V1[1] - 3)
 in mind which looks innocent enough but performs terribly (note that I
 had
 to scale down the loop by a factor of 100!!!):

 system.time(for (i in 1:10) within(DT, V1[1] - 3))
user  system elapsed
   2.701   4.437   7.138

 No, since 'with' is already built into data.table, I was thinking of
 building 'within' in, too. I'll take a look at within(). Might as well
 provide as many options as possible to the user to use as they wish.

 With the for loop something like within(DF, for (i in 1:1000) V1[i] -
 3))
 performs reasonably:

 system.time(within(DT, for (i in 1:1000) V1[i] - 3))
user  system elapsed
   0.392   0.613   1.003

 (Note: system.time() can be misleading when within() is involved,
 because
 the expression is evaluated in a different environment so within() won't
 actually change the object in the  global environment - it also
 interacts
 with the possible duplication)

 Noted, thanks. That's pretty fast. Does within() on data.frame fix the
 original issue Ivo raised, then?  If so, job done.


 Cheers,
 Simon

 On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

 Thanks for the replies and info. An attempt at fast
 assign is now committed to data.table v1.6.3 on
 R-Forge. From NEWS :

 o   Fast update is now implemented, FR#200.
DT[i,j]-value is now handled by data.table in C rather
than falling through to data.frame methods.

Thanks to Ivo Welch for raising speed issues on r-devel,
to Simon Urbanek for the suggestion, and Luke Tierney and
Simon for information on R internals.

[- syntax still incurs one working copy of the whole
table (as of R 2.13.0) due to R's [- dispatch mechanism
copying to `*tmp*`, so, for ultimate speed and brevity,
'within' syntax is now available as follows.

 o   A new 'within' argument has been added to [.data.table,
by default TRUE. It is very similar to the within()
function in base R. If an assignment appears in j, it
assigns to the column of DT, by reference; e.g.,

DT[i,colname-value]

This syntax makes no copies of any part of memory at all.

 m = matrix(1,nrow=10,ncol=100)
 DF = as.data.frame(m)
 DT = as.data.table(m)
 system.time(for (i in 1:1000) DF[1,1] - 3)
   user  system elapsed
287.730 323.196 613.453
 system.time(for (i in 1:1000) DT[1,V1 - 3])
   user  system elapsed
  1.152   0.004   1.161 # 528 times faster

 Please note :

***
**  Within syntax is presently highly experimental.  **
***

 http://datatable.r-forge.r-project.org/


 On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
 On Wed, 6 Jul 2011, Simon Urbanek wrote:

 Interesting, and I stand corrected:

 x = data.frame(a=1:n,b=1:n)
 .Internal(inspect(x))
 @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[1,1]=42L
 .Internal(inspect(x))
 @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=10, tl=0) 1,2,3,4,5,...

 x[[1]][1]=42L
 .Internal(inspect(x))
 @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=10, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=10, tl=0) 1,2,3,4,5,...


 I have R to release ;) so I won't be looking

Re: [Rd] Suggestions for R-devel / R-help digest format

2011-07-08 Thread Matthew Dowle


Don't most people use a newsreader? For example, pointed to here :
gmane.comp.lang.r.general
gmane.comp.lang.r.devel

IIUC, NNTP downloads headers only, when you open any post it downloads the 
body at that
point. So it's more efficient than email (assuming you don't open every 
single post). I guess
RSS is similar/better. Newsreaders handle threading and you can watch/ignore 
threads easily.

Actually subscribing via email?  The only reason I am subscribed is to post 
unmoderated (and to
encourage Martin with +1 on his subscriber count); I have email delivery 
turned off in the mailman
settings.  Thought everyone did that!   If I counted correctly, there are 36 
gmane mirrors for
various packages and sigs. You can watch all these (including r-devel and 
r-help) via gmane without
needing to subscribe on mailman at all.

Matthew

Saravanan saravanan.thirumuruganat...@gmail.com wrote in message 
news:4e160850.1040...@gmail.com...
 Thanks Steve and Brian !

 Probably, I will create a gmail account for mailing lists and let it take 
 care of the threading.

 Regards,
 Saravanan

 On 07/07/2011 12:02 PM, Brian G. Peterson wrote:
 On Thu, 2011-07-07 at 11:44 -0500, Saravanan wrote:
 Hello,

 I am passive reader of both R-devel and R-help mailing lists. I am
 sending the following comments to r-devel as it seemed more suitable. I
 am aware that this list uses GNU mailman for the list management. I have
 my options set that it sends a email digest. One thing I find is that
 the digest consists of emails that ordered temporarlly. For eg lets say
 there are two threads t1 and t2 and the emails arrive as e1 of t1, e2 of
 t2, e3 of t3  . The digest lists them as e1,e2 and then e3. Is it
 possible to somehow configure it as T1 : e1,e3 and then T2 : e2 ?

 This is the digest format that google groups uses which is incredibly
 helpful as you can read all the messages in a thread. Additionally, it
 also helpfully includes a header that lists all the threads in digest so
 that you can jump to the one you are interested in. I checked the
 mailman options but could not find any.

 Does anyone else have the same issue? It is not a big issue in R-devel
 but R-help is a much more high traffic mailing list. I am interested in
 hearing how you read/filter your digest mails in either R-help or other
 high volume mailing lists.
 This really has nothing to do with R, but rather mailman.

 I use folders, filtered on the server using SIEVE and/or procmail.  No
 digest required. I get the mails immediately, not later in the day or
 the next day,  and can use all my various email clients easily to
 read/respond.

 mailman supports a MIME digest format that includes a table of contents
 with links to each MIME part.  mailman does not support a threaded
 digest, to the best of my knowledge.

 Regards,

 - Brian



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [datatable-help] speeding up perception

2011-07-06 Thread Matthew Dowle


On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
 No subassignment function satisfies that condition, because you can always 
 call them directly. However, that doesn't stop the default method from making 
 that assumption, so I'm not sure it's an issue.
 
 David, Just to clarify - the data frame content is not copied, we are talking 
 about the vector holding columns.

If it is just the vector holding the columns that is copied (and not the
columns themselves), why does n make a difference in this test (on R
2.13.0)?

 n = 1000
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
  0.628   0.000   0.628 
 n = 10
 x = data.frame(a=1:n,b=1:n)  # still 2 columns, but longer columns
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
 20.145   1.232  21.455 
 

With $- :

 n = 1000
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x$a[1] - 42L)
   user  system elapsed 
  0.304   0.000   0.307 
 n = 10
 x = data.frame(a=1:n,b=1:n)
 system.time(for (i in 1:1000) x$a[1] - 42L)
   user  system elapsed 
 37.586   0.388  38.161 
 

If it's because the 1st column needs to be copied (only) because that's
the one being assigned to (in this test), that magnitude of slow down
doesn't seem consistent with the time of a vector copy of the 1st
column : 

 n=10
 v = 1:n
 system.time(for (i in 1:1000) v[1] - 42L)
   user  system elapsed 
  0.016   0.000   0.017 
 system.time(for (i in 1:1000) {v2=v;v2[1] - 42L})
   user  system elapsed 
  1.816   1.076   2.900

Finally, increasing the number of columns, again only the 1st is
assigned to :

 n=10
 x = data.frame(rep(list(1:n),100))
 dim(x)
[1] 10100
 system.time(for (i in 1:1000) x[1,1] - 42L)
   user  system elapsed 
167.974  50.903 219.711 
 



 
 Cheers,
 Simon
 
 Sent from my iPhone
 
 On Jul 5, 2011, at 9:01 PM, David Winsemius dwinsem...@comcast.net wrote:
 
  
  On Jul 5, 2011, at 7:18 PM, luke-tier...@uiowa.edu 
  luke-tier...@uiowa.edu wrote:
  
  On Tue, 5 Jul 2011, Simon Urbanek wrote:
  
  
  On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
  
  Simon (and all),
  
  I've tried to make assignment as fast as calling `[-.data.table`
  directly, for user convenience. Profiling shows (IIUC) that it isn't
  dispatch, but x being copied. Is there a way to prevent '[-' from
  copying x?
  
  Good point, and conceptually, no. It's a subassignment after all - see 
  R-lang 3.4.4 - it is equivalent to
  
  `*tmp*` - x
  x - `[-`(`*tmp*`, i, j, value)
  rm(`*tmp*`)
  
  so there is always a copy involved.
  
  Now, a conceptual copy doesn't mean real copy in R since R tries to keep 
  the pass-by-value illusion while passing references in cases where it 
  knows that modifications cannot occur and/or they are safe. The default 
  subassign method uses that feature which means it can afford to not 
  duplicate if there is only one reference -- then it's safe to not 
  duplicate as we are replacing that only existing reference. And in the 
  case of a matrix, that will be true at the latest from the second 
  subassignment on.
  
  Unfortunately the method dispatch (AFAICS) introduces one more reference 
  in the dispatch chain so there will always be two references so 
  duplication is necessary. Since we have only 0 / 1 / 2+ information on 
  the references, we can't distinguish whether the second reference is due 
  to the dispatch or due to the passed object having more than one 
  reference, so we have to duplicate in any case. That is unfortunate, and 
  I don't see a way around (unless we handle subassignment methods is some 
  special way).
  
  I don't believe dispatch is bumping NAMED (and a quick experiment
  seems to confirm this though I don't guarantee I did that right). The
  issue is that a replacement function implemented as a closure, which
  is the only option for a package, will always see NAMED on the object
  to be modified as 2 (because the value is obtained by forcing the
  argument promise) and so any R level assignments will duplicate.  This
  also isn't really an issue of imprecise reference counting -- there
  really are (at least) two legitimate references -- one though the
  argument and one through the caller's environment.
  
  It would be good it we could come up with a way for packages to be
  able to define replacement functions that do not duplicate in cases
  where we really don't want them to, but this would require coming up
  with some sort of protocol, minimally involving an efficient way to
  detect whether a replacement funciton is being called in a replacement
  context or directly.
  
  Would $- always satisfy that condition. It would be big help to me if it 
  could be designed to avoid duplication the rest of the data.frame.
  
  -- 
  
  
  There are some replacement functions that use C code to cheat, but
  these may create problems if called directly, so I won't advertise
  them.
  
  Best,
  
  luke
  
  
  Cheers

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread Matthew Dowle


Simon,

Thanks for the great suggestion. I've written a skeleton assignment
function for data.table which incurs no copies, which works for this
case. For completeness, if I understand correctly, this is for : 
  i) convenience of new users who don't know how to vectorize yet
  ii) more complex examples which can't be vectorized.

Before:

 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed 
 12.792   0.488  13.340 

After :

 system.time(for (r in 1:R) DT[r,20] - 1.0)
   user  system elapsed 
  2.908   0.020   2.935

Where this can be reduced further as follows :

 system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))
   user  system elapsed 
  0.132   0.000   0.131 
 

Still working on it. When it doesn't break other data.table tests, I'll
commit to R-Forge ...

Matthew


On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
 Timothée,
 
 On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
 
  Hi --
  
  It's my first post on this list; as a relatively new user with little
  knowledge of R internals, I am a bit intimidated by the depth of some
  of the discussions here, so please spare me if I say something
  incredibly silly.
  
  I feel that someone at this point should mention Matthew Dowle's
  excellent data.table package
  (http://cran.r-project.org/web/packages/data.table/index.html) which
  seems to me to address many of the inefficiencies of data.frame.
  data.tables have no row names; and operations that only need data from
  one or two columns are (I believe) just as quick whether the total
  number of columns is 5 or 1000. This results in very quick operations
  (and, often, elegant code as well).
  
 
 I agree that data.table is a very good alternative (for other reasons) that 
 should be promoted more. The only slight snag is that it doesn't help with 
 the issue at hand since it simply does a pass-though for subassignments to 
 data frame's methods and thus suffers from the same problems (in fact there 
 is a rather stark asymmetry in how it handles subsetting vs subassignment - 
 which is a bit surprising [if I read the code correctly you can't use the 
 same indexing in both]). In fact I would propose that it should not do that 
 but handle the simple cases itself more efficiently without unneeded copies. 
 That would make it indeed a very interesting alternative.
 
 Cheers,
 Simon
 
 
  
  On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote:
  thank you, simon.  this was very interesting indeed.  I also now
  understand how far out of my depth I am here.
  
  fortunately, as an end user, obviously, *I* now know how to avoid the
  problem.  I particularly like the as.list() transformation and back to
  as.data.frame() to speed things up without loss of (much)
  functionality.
  
  
  more broadly, I view the avoidance of individual access through the
  use of apply and vector operations as a mixed IQ test and knowledge
  test (which I often fail).  However, even for the most clever, there
  are also situations where the KISS programming principle makes
  explicit loops still preferable.  Personally, I would have preferred
  it if R had, in its standard statistical data set data structure,
  foregone the row names feature in exchange for retaining fast direct
  access.  R could have reserved its current implementation with row
  names but slow access for a less common (possibly pseudo-inheriting)
  data structure.
  
  
  If end users commonly do iterations over a data frame, which I would
  guess to be the case, then the impression of R by (novice) end users
  could be greatly enhanced if the extreme penalties could be eliminated
  or at least flagged.  For example, I wonder if modest special internal
  code could store data frames internally and transparently as lists of
  vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
  but specific warning message could be issued with a suggestion if
  there is an individual read/write into a data frame (Warning: data
  frames are much slower than lists of vectors for individual element
  access).
  
  
  I would also suggest changing the Introduction to R 6.3  from A
  data frame may for many purposes be regarded as a matrix with columns
  possibly of differing modes and attributes. It may be displayed in
  matrix form, and its rows and columns extracted using matrix indexing
  conventions. to A data frame may for many purposes be regarded as a
  matrix with columns possibly of differing modes and attributes. It may
  be displayed in matrix form, and its rows and columns extracted using
  matrix indexing conventions.  However, data frames can be much slower
  than matrices or even lists of vectors (which, like data frames, can
  contain different types of columns) when individual elements need to
  be accessed.  Reading about it immediately upon introduction could
  flag the problem in a more visible manner.
  
  
  regards,
  
  /iaw
  
  __

Re: [Rd] [datatable-help] speeding up perception

2011-07-05 Thread Matthew Dowle

Simon (and all),

I've tried to make assignment as fast as calling `[-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[-' from
copying x?  Small reproducible example in vanilla R 2.13.0 :

 x = list(a=1:1,b=1:1)
 class(x) = newclass
 [-.newclass = function(x,i,j,value) x  # i.e. do nothing
 tracemem(x)
[1] 0xa1ec758
 x[1,2] = 42L
tracemem[0xa1ec758 - 0xa1ec558]:# but, x is still copied, why?
 

I've tried returning NULL from [-.newclass but then x gets assigned
NULL :

 [-.newclass = function(x,i,j,value) NULL
 x[1,2] = 42L
tracemem[0xa1ec558 - 0x9c5f318]: 
 x
NULL
 

Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew


On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
 Simon,
 
 Thanks for the great suggestion. I've written a skeleton assignment
 function for data.table which incurs no copies, which works for this
 case. For completeness, if I understand correctly, this is for : 
   i) convenience of new users who don't know how to vectorize yet
   ii) more complex examples which can't be vectorized.
 
 Before:
 
  system.time(for (r in 1:R) DT[r,20] - 1.0)
user  system elapsed 
  12.792   0.488  13.340 
 
 After :
 
  system.time(for (r in 1:R) DT[r,20] - 1.0)
user  system elapsed 
   2.908   0.020   2.935
 
 Where this can be reduced further as follows :
 
  system.time(for (r in 1:R) `[-.data.table`(DT,r,2,1.0))
user  system elapsed 
   0.132   0.000   0.131 
  
 
 Still working on it. When it doesn't break other data.table tests, I'll
 commit to R-Forge ...
 
 Matthew
 
 
 On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
  Timothée,
  
  On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
  
   Hi --
   
   It's my first post on this list; as a relatively new user with little
   knowledge of R internals, I am a bit intimidated by the depth of some
   of the discussions here, so please spare me if I say something
   incredibly silly.
   
   I feel that someone at this point should mention Matthew Dowle's
   excellent data.table package
   (http://cran.r-project.org/web/packages/data.table/index.html) which
   seems to me to address many of the inefficiencies of data.frame.
   data.tables have no row names; and operations that only need data from
   one or two columns are (I believe) just as quick whether the total
   number of columns is 5 or 1000. This results in very quick operations
   (and, often, elegant code as well).
   
  
  I agree that data.table is a very good alternative (for other reasons) that 
  should be promoted more. The only slight snag is that it doesn't help with 
  the issue at hand since it simply does a pass-though for subassignments to 
  data frame's methods and thus suffers from the same problems (in fact there 
  is a rather stark asymmetry in how it handles subsetting vs subassignment - 
  which is a bit surprising [if I read the code correctly you can't use the 
  same indexing in both]). In fact I would propose that it should not do that 
  but handle the simple cases itself more efficiently without unneeded 
  copies. That would make it indeed a very interesting alternative.
  
  Cheers,
  Simon
  
  
   
   On Mon, Jul 4, 2011 at 6:19 AM, ivo welch ivo.we...@gmail.com wrote:
   thank you, simon.  this was very interesting indeed.  I also now
   understand how far out of my depth I am here.
   
   fortunately, as an end user, obviously, *I* now know how to avoid the
   problem.  I particularly like the as.list() transformation and back to
   as.data.frame() to speed things up without loss of (much)
   functionality.
   
   
   more broadly, I view the avoidance of individual access through the
   use of apply and vector operations as a mixed IQ test and knowledge
   test (which I often fail).  However, even for the most clever, there
   are also situations where the KISS programming principle makes
   explicit loops still preferable.  Personally, I would have preferred
   it if R had, in its standard statistical data set data structure,
   foregone the row names feature in exchange for retaining fast direct
   access.  R could have reserved its current implementation with row
   names but slow access for a less common (possibly pseudo-inheriting)
   data structure.
   
   
   If end users commonly do iterations over a data frame, which I would
   guess to be the case, then the impression of R by (novice) end users
   could be greatly enhanced if the extreme penalties could be eliminated
   or at least flagged.  For example, I wonder if modest special internal
   code could store data frames internally and transparently as lists of
   vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
   but specific warning message could be issued with a suggestion

[Rd] help.request() for packages?

2011-04-26 Thread Matthew Dowle

Hi,

Have I missed something, or misunderstood?

The r-help posting guide asks users to contact the package maintainer :

   If the question relates to a contributed package, e.g., one
downloaded from CRAN, try contacting the package maintainer first.
[snip] ONLY [only is bold font] send such questions to R-help or R-devel
if you get no reply or need further assistance. This applies to both
requests for help and to bug reports.

but the R-ext guide contains :

   The mandatory ‘Maintainer’ field should give a single name with a
valid (RFC 2822) email address in angle brackets (for sending bug
reports etc.). It should not end in a period or comma. For a public
package it should be a person, not a mailing list and not a corporate 
entity: do ensure that it is valid and will remain valid for the
lifetime of the package.

Currently, data.table contains the datatable-help mailing list in the
'Maintainer' field, with the posting guide in mind (and service levels
for users). This mailing list is where we would like users to ask
questions about the package, not r-help, and not a single person.
However, R-exts says that the 'Maintainer' email address should not be a
mailing list.

There seems to be two requirements:
   i) a non-bouncing email address that CRAN maintainers can use - more
like the 'Administrator' of the package
   ii) a support address for users to send questions and bug reports

The BugReports field in DESCRIPTION is for bugs only, and allows only a
URL, not an email address. bug.reports() has a 'package' argument and
emails the Maintainer field if the BugReports URL is not provided by the
package. So, BugReports seems close, but not quite what we'd like.

help.request() appears to have no package argument (I checked R 2.13.0).

Could a Support field (or better name) be added to DESCRIPTION, and a
'package' argument added to help.request() which uses it?  Then the
semantics of the Maintainer field can be closer to what the CRAN
maintainers seem to think of it; i.e., the package 'Administrator'.

Have I misunderstood or missed an option?

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] method=radix in sort.list() isn't actually a radix sort

2011-02-16 Thread Matthew Dowle

Dear list,

Were you aware that, strictly speaking, do_radixsort in sort.c actually
implements a counting sort, not a radix sort ?

http://en.wikipedia.org/wiki/Counting_sort

It it was a radix sort it wouldn't need the 100,000 range restriction.

Clearly the method argument can't be changed (now) from radix to
counting, but perhaps a note could be added to the .Rd ?

According to Wikipedia, Harold H. Seward created both counting and
radix sorting in 1954, and they are distinctly different.

I did a grep through all R source for the keyword radix in case this
was already documented. A google search and rseek.org search didn't
return results for counting sort in the R context.

There appears to be scope to add (true) radix sorting to R then, that
doesn't have the 100,000 range restriction.  Is there any interest in that?

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] match function causing bad performance when usingtablefunction on factors with multibyte characters on Windows

2011-01-25 Thread Matthew Dowle


I don't know if that's enough to flip the UTF8 switches
internally in R. If it is enough, then this result may show
I'm barking up the wrong tree. Hopefully someone from
core is watching who knows. Is it feasible that you run
R using an alias, and for some reason the alias is not
picking up your shell variables. Best to rule that out now
by running sessionInfo() at the R prompt.

Otherwise do you know profiling tools sufficiently to trace the
problem at the C level as it runs on Windows?

Matthew

Karl Ove Hufthammer k...@huftis.org wrote in message 
news:ihm9qq$9ej$1...@dough.gmane.org...
 Matthew Dowle wrote:

 I'm not sure, but note the difference in locale between
 Linux (UTF-8) and Windows (non UTF-8). As far as I
 understand it R much prefers UTF-8, which Windows doesn't
 natively support. Otherwise you could just change your
 Windows locale to a UTF-8 locale to make R happier.

 [...]

 If anybody knows a way to trick R on Linux into thinking it has
 an encoding similar to Windows then I may be able to take a
 look if I can reproduce the problem in Linux.

 Changing the locale to an ISO 8859-1 locale, i.e.:

 export LC_ALL=en_US.ISO-8859-1
 export LANG=en_US.ISO-8859-1

 I could *not* reproduce it; that is, 'table' is as fast on the non-ASCII
 factor as it is on the ASCII factor.

 -- 
 Karl Ove Hufthammer

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

2011-01-25 Thread Matthew Dowle


Thanks Simon!  I can reproduce this on Linux now, too.
locale -a didn't show en_US.iso88591 for me so I needed
'sudo locale-gen en_US' first.
Then running R with 
$ LANG=en_US.ISO-8859-1 R
is enough to reproduce the problem.

Karl - can you use tabulate instead as Simon suggests?

Matthew

-- 
View this message in context: 
http://r.789695.n4.nabble.com/match-function-causing-bad-performance-when-using-table-function-on-factors-with-multibyte-characters-tp3229526p3237228.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

2011-01-24 Thread Matthew Dowle


I'm not sure, but note the difference in locale between
Linux (UTF-8) and Windows (non UTF-8). As far as I
understand it R much prefers UTF-8, which Windows doesn't
natively support. Otherwise you could just change your
Windows locale to a UTF-8 locale to make R happier.

My stab in the dark would be that the poor performance on
Windows in this case may be down to many calls to
translateCharUTF8 internally.

There was a change in R 2.12.0 in this area. Running your
test in R 2.11.1 on Windows shows the same problem though
so it doesn't look like that change caused this problem.

From NEWS 2.12.0 :
o  unique() and match() are now faster on character vectors
where all elements are in the global CHARSXP cache and
have unmarked encoding (ASCII). Thanks to Matthew
Dowle for suggesting improvements to the way the hash
code is generated in 'unique.c'

If anybody knows a way to trick R on Linux into thinking it has
an encoding similar to Windows then I may be able to take a
look if I can reproduce the problem in Linux.

Matthew


Karl Ove Hufthammer k...@huftis.org wrote in message 
news:ihbko3$efs$1...@dough.gmane.org...
 [I originally posted this on the R-help mailing list, and it was suggested 
 that R-devel would be a better
 place to dicuss it.]

 Running 'table' on a factor with levels containing non-ASCII characters
 seems to result in extremely bad performance on Windows. Here's a simple
 example with benchmark results (I've reduced the number of replications to
 make the function finish within reasonable time):

  library(rbenchmark)
  x.num=sample(1:2, 10^5, replace=TRUE)
  x.fac.ascii=factor(x.num, levels=1:2, labels=c(A,B))
  x.fac.nascii=factor(x.num, levels=1:2, labels=c(Æ,Ø))
  benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), 
 table(unclass(x.fac.nascii)), replications=20 )

test replications elapsed   relative user.self 
 sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))   201.53   4.636364  1.51 
 0.01 NANA
  2   table(x.fac.ascii)   200.33   1.00  0.33 
 0.00 NANA
  3  table(x.fac.nascii)   20  146.67 444.454545 38.52 
 81.74 NANA
  1 table(x.num)   201.55   4.696970  1.53 
 0.01 NANA

  sessionInfo()
  R version 2.12.1 (2010-12-16)
  Platform: i386-pc-mingw32/i386 (32-bit)

  locale:
  [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
 LC_CTYPE=Norwegian-Nynorsk_Norway.1252 
 LC_MONETARY=Norwegian-Nynorsk_Norway.1252
  [4] LC_NUMERIC=C 
 LC_TIME=Norwegian-Nynorsk_Norway.1252

  attached base packages:
  [1] stats graphics  grDevices datasets  utils methods   base

  other attached packages:
  [1] rbenchmark_0.3

 The timings are from R 2.12.1, but I also get comparable results
 on the latest prelease (R 2.13.0 2011-01-18 r54032).

 Running the same test (100 replications) on a Linux system with
 R.12.1 Patched results in essentially no difference between the
 performance on ASCII factors and non-ASCII factors:

test replications elapsed relative user.self 
 sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))  100   4.607 3.096102 4.455 
 0.092  0 0
  2   table(x.fac.ascii)  100   1.488 1.00 1.459 
 0.028  0 0
  3  table(x.fac.nascii)  100   1.616 1.086022 1.560 
 0.051  0 0
  1 table(x.num)  100   4.504 3.026882 4.403 
 0.079  0 0

  sessionInfo()
  R version 2.12.1 Patched (2011-01-18 r54033)
  Platform: i686-pc-linux-gnu (32-bit)

  locale:
   [1] LC_CTYPE=nn_NO.UTF-8   LC_NUMERIC=C 
 LC_TIME=nn_NO.UTF-8
   [4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C 
 LC_MESSAGES=nn_NO.UTF-8
   [7] LC_PAPER=nn_NO.UTF-8   LC_NAME=C  LC_ADDRESS=C
  [10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 
 LC_IDENTIFICATION=C

  attached base packages:
  [1] stats graphics  grDevices utils datasets  methods   base

  other attached packages:
  [1] rbenchmark_0.3

 Profiling the 'table' function indicates almost all the time if spent in
 the 'match' function, which is used when 'factor' is used on a 'factor'
 inside 'table'. Indeed, 'x.fac.nascii = factor(x.fac.nascii)' by itself
 is extremely slow.

 Is there any theoretical reason 'factor' on 'factor' with non-ASCII
 characters must be so slow? And why doesn't this happen on Linux?

 Perhaps a fix for 'table' might be calculating the 'table' statistics
 *including* all levels (not using the 'factor' function anywhere),
 and then removing the 'exclude' levels in the end. For example,
 something along these lines:

 res = table.modified.to.not.use.factor(...)
 ind = lapply(dimnames(res), function(x) !(x %in% exclude))
 do.call([, c(list(res), ind, drop=FALSE))

 (I haven't tested this very much, so there may

Re: [Rd] reliability of R-Forge? (moving to r-Devel)

2010-08-26 Thread Matthew Dowle


Spencer and David,

My experience of R-Forge :

i) SVN access and project management web pages have been *very* reliable all 
this year ... up until the weekend.  This week was the first time I ever saw 
R-Forge Could Not Connect to Database.

ii) The nightly build and checks have been consistently unreliable all year. 
At best the nightly build is a few days behind the latest commit, but they 
are working on it. This isn't as critical as (i) though since users can 
install from source: 
install.packages(pkg,type=source,repos=http://R-Forge.R-project.org;).

iii) Mailing lists have been down since the weekend and I too have been 
mailing r-fo...@r-project.org with no response.  That is *very* unusual; 
first time.

Hope that helps to put it into context at least.

Matthew

P.S. I notice that R-Forge appears to be back up now, including the mailing 
lists.


Spencer Graves spencer.gra...@structuremonitoring.com wrote in message 
news:4c762b50.7000...@structuremonitoring.com...
  Hello:


   Can anyone comment on plans for R-Forge?  Please see thread below.


   Ramsay, Hooker and I would like to release a new version of fda
to CRAN.  We committed changes for it last Friday.  I'd like to see
reports of their daily checks, then submit to CRAN from R-Forge.
Unfortunately, it seems to be down now, saying R-Forge Could Not
Connect to Database:.  I just tried, 'install.packages(fda,
repos=http://R-Forge.R-project.org;)', and got the previous version,
which indicates that my changes from last Friday have not been built
yet.  Also, a few days ago, I got an error from
'install.packages(pfda, repos=http://R-Forge.R-project.org;)' (a
different package, 'pfda' NOT 'fda').  I don't remember the error
message, but this same command worked for me just now.


   I infer from this that I should consider submitting the latest
version of 'fda' to CRAN manually, not waiting for the R-Forge
[formerly] daily builds and checks.


   R-Forge is an incredibly valuable resource.  It would be even
more valuable if it were more reliable.  I very much appreciate the work
of the volunteers who maintain it;  I am unfortunately not in a position
to volunteer to do more for the R-Project generally and R-Forge in
particular than I already do.


   Thanks,
   Spencer Graves


On 8/26/2010 1:07 AM, Jari Oksanen wrote:
 David Kanedaveat  kanecap.com  writes:

 How reliable is R-Forge? http://r-forge.r-project.org/

 It is down now (for me). Reporting R-Forge Could Not Connect to 
 Database: 

 I have just started to use it for a project. It has been down for
 several hours (at least) on different occasions over the last couple
 of days. Is that common? Will it be more stable soon?

 Apologies if this is not an appropriate question for R-help.

 Dave,

 This is rather a subject for R-devel. Ignoring this inappropriateness: 
 yes,
 indeed, R-Forge has been flaky lately. The database was disconnected for 
 the
 whole weekend, came back on Monday, and is gone again. It seems that 
 mailing
 lists and email alerts of commits were not working even when the basic 
 R-Forge
 was up.

 I have sent two messages to r-fo...@r-project.org on these problems. I 
 haven't
 got no response, but soon after the first message the Forge woke up, and 
 soon
 after the second message it went down. Since I'm not Bayeasian, I don't 
 know
 what to say about the effect of my messages.

 Cheers, Jari Oksanen

 __
 r-h...@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
Spencer Graves, PE, PhD
President and Chief Operating Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Non-blocking Eval

2010-08-11 Thread Matthew Dowle

There is a video demo of exactly that on the data.table homepage :

http://datatable.r-forge.r-project.org/
http://www.youtube.com/watch?v=rvT8XThGA8o

However, last time I looked, svSocket uses text transfer. It would be really
great if it did binary serialization, like Rserve does.

Previous threads :
http://r.789695.n4.nabble.com/Using-svSocket-with-data-table-tp924554p924554.html
http://r.789695.n4.nabble.com/Video-demo-of-using-svSocket-with-data-table-tp893671p893672.html

This one contains a comparison of Rserve and svSocket :
http://r.789695.n4.nabble.com/Fwd-Re-Video-demo-of-using-svSocket-with-data-table-tp903723p903723.html

Best, Matthew

Philippe Grosjean phgrosj...@sciviews.org wrote in message
news:4c629ab7.60...@sciviews.org...
Hello,

For non-blocking access to R through sockets, you should also look at
svSockets. It may be more appropriate than RServer for feeding data to R,
while you have another process running in R that do smething like updating
a graph, or some other calculations.
Best,

Philippe Grosjean

On 20/07/10 14:10, Martin Kerr wrote:

Sorry I phrased that badly.
What I'm trying to do is asynchronously add data to R, i.e. a program
will periodically dump some readings to the Rserver and then later on
another program will run some analysis scripts on them.
I have managed to add the data via CMD_detachedVoidEval as you suggested.
How exactly do I go about attaching to the session again? I know it
involves some form of session key that comes back from the detach call,
but what from does it take? And how do I use this?
Thanks AgainMartin
Subject: Re: [Rd] Non-blocking Eval
From: simon.urba...@r-project.org
Date: Mon, 19 Jul 2010 11:34:29 -0400
CC: r-devel@r-project.org
To: mk2...@hotmail.com

On Jul 19, 2010, at 10:58 AM, Martin Kerr wrote:

Hello,
I'm currently working with the C++ version of the Rserve Client as part
of a student project.
Is there an implementation of a non-blocking interface to Rserve in
C++? I can find one via the Java JRI but no equivalent in C++.

(Please note that stats-rosuda-devel is the correct list for this.)

I'm not quite sure what you mean, because in JRI there is idleEval()
which is non-blocking in the sense that it doesn't do anything if R is
busy but that doesn't apply to Rserve as by definition R cannot be busy
there. There is no non-blocking interface to JRI - all calls are
synchronous.

If your question is whether you can start an evaluation in Rserve and
not wait for the result then there is CMD_detachedVoidEval in Rserve,
but the C++ client only implements a subset of the API which does not
include that -- however, it is trivial to implement (just send a request
with CMD_detachedVoidEval as there is nothing to decode).

Cheers,
Simon

Do you have a story that started on Hotmail? Tell us now
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] suggestion how to use memcpy in duplicate.c

2010-04-22 Thread Matthew Dowle


Is this a thumbs up for memcpy for DUPLICATE_ATOMIC_VECTOR at least ?

If there is further specific testing then let me know, happy to help, but 
you seem to have beaten me to it.

Matthew


Simon Urbanek simon.urba...@r-project.org wrote in message 
news:65d21b93-a737-4a94-bdf4-ad7e90518...@r-project.org...

 On Apr 21, 2010, at 2:15 PM, Seth Falcon wrote:

 On 4/21/10 10:45 AM, Simon Urbanek wrote:
 Won't that miss the last incomplete chunk? (and please don't use
 DATAPTR on INTSXP even though the effect is currently the same)

 In general it seems that the it depends on nt whether this is
 efficient or not since calls to short memcpy are expensive (very
 small nt that is).

 I ran some empirical tests to compare memcpy vs for() (x86_64, OS X)
 and the results were encouraging - depending on the size of the
 copied block the difference could be quite big: tiny block (ca. n =
 32 or less) - for() is faster small block (n ~ 1k) - memcpy is ca. 8x
 faster as the size increases the gap closes (presumably due to RAM
 bandwidth limitations) so for n = 512M it is ~30%.


 Of course this is contingent on the implementation of memcpy,
 compiler, architecture etc. And will only matter if copying is what
 you do most of the time ...

 Copying of vectors is something that I would expect to happen fairly 
 often in many applications of R.

 Is for() faster on small blocks by enough that one would want to branch 
 based on size?


 Good question. Given that the branching itself adds overhead possibly not. 
 In the best case for() can be ~40% faster (for single-digit n) but that 
 means billions of copies to make a difference (since the operation itself 
 is so fast). The break-even point on my test machine is n=32 and when I 
 added the branching it took 20% hit so I guess it's simply not worth it. 
 The only case that may be worth branching is n:1 since that is likely a 
 fairly common use (the branching penalty in copy routines is lower than 
 comparing memcpy/for implementations since the branching can be done 
 before the outer for loop so this may vary case-by-case).

 Cheers,
 Simon


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] suggestion how to use memcpy in duplicate.c

2010-04-22 Thread Matthew Dowle


Just to add some clarification, the suggestion wasn't motivated by speeding 
up a length 3 vector being recycled 3.3 million times.  But its a good point 
that any change should not make that case slower.  I don't know how much 
vectorCopy is called really,  DUPLICATE_ATOMIC_VECTOR seems more 
significant, which doesn't recycle, and already had the FIXME next to it.

Where copyVector is passed a large source though, then memcpy should be 
faster than any of the methods using a for loop through each element 
(whether recycling or not),  allowing for the usual caveats. What are the 
timings like if you repeat the for loop 100 times to get a more robust 
timing ?  It needs to be a repeat around the for loop only, not the 
allocVector whose variance looks to be included in those timings below. Then 
increase the size of the source vector,  and compare to memcpy.

Matthew

William Dunlap wdun...@tibco.com wrote in message 
news:77eb52c6dd32ba4d87471dcd70c8d70002ce6...@na-pa-vbe03.na.tibco.com...
If I were worried about the time this loop takes,
I would avoid using i%nt.  For the attached C code
compile with gcc 4.3.3 with -O2 I get
   # INTEGER() in loop
   system.time( r1 - .Call(my_rep1, 1:3, 1e7) )
 user  system elapsed
0.060   0.012   0.071

   # INTEGER() before loop
   system.time( r2 - .Call(my_rep2, 1:3, 1e7) )
 user  system elapsed
0.076   0.008   0.086

   # replace i%src_length in loop with j=0 before loop and
   #if(++j==src_length) j=0 ;
   # in the loop.
   system.time( r3 - .Call(my_rep3, 1:3, 1e7) )
 user  system elapsed
0.024   0.028   0.050
   identical(r1,r2)  identical(r2,r3)
  [1] TRUE

The C code is:
#define USE_RINTERNALS /* pretend we are in the R kernel */
#include R.h
#include Rinternals.h


SEXP my_rep1(SEXP s_src, SEXP s_dest_length)
{
int src_length = length(s_src) ;
int dest_length = asInteger(s_dest_length) ;
int i,j ;
SEXP s_dest ;
PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ;
for(i=0;idest_length;i++) {
INTEGER(s_dest)[i] = INTEGER(s_src)[i % src_length] ;
}
UNPROTECT(1) ;
return s_dest ;
}
SEXP my_rep2(SEXP s_src, SEXP s_dest_length)
{
int src_length = length(s_src) ;
int dest_length = asInteger(s_dest_length) ;
int *psrc = INTEGER(s_src) ;
int *pdest ;
int i ;
SEXP s_dest ;
PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
pdest = INTEGER(s_dest) ;
if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ;
/* end of boilerplate */
for(i=0;idest_length;i++) {
pdest[i] = psrc[i % src_length] ;
}
UNPROTECT(1) ;
return s_dest ;
}
SEXP my_rep3(SEXP s_src, SEXP s_dest_length)
{
int src_length = length(s_src) ;
int dest_length = asInteger(s_dest_length) ;
int *psrc = INTEGER(s_src) ;
int *pdest ;
int i,j ;
SEXP s_dest ;
PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
pdest = INTEGER(s_dest) ;
if(TYPEOF(s_src) != INTSXP) error(src must be integer data) ;
/* end of boilerplate */
for(j=0,i=0;idest_length;i++) {
*pdest++ = psrc[j++] ;
if (j==src_length) {
j = 0 ;
}
}
UNPROTECT(1) ;
return s_dest ;
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

 -Original Message-
 From: r-devel-boun...@r-project.org
 [mailto:r-devel-boun...@r-project.org] On Behalf Of Romain Francois
 Sent: Wednesday, April 21, 2010 12:32 PM
 To: Matthew Dowle
 Cc: r-de...@stat.math.ethz.ch
 Subject: Re: [Rd] suggestion how to use memcpy in duplicate.c

 Le 21/04/10 17:54, Matthew Dowle a écrit :
 
  From copyVector in duplicate.c :
 
  void copyVector(SEXP s, SEXP t)
  {
   int i, ns, nt;
   nt = LENGTH(t);
   ns = LENGTH(s);
   switch (TYPEOF(s)) {
  ...
   case INTSXP:
   for (i = 0; i  ns; i++)
   INTEGER(s)[i] = INTEGER(t)[i % nt];
   break;
  ...
 
  could that be replaced with :
 
   case INTSXP:
   for (i=0; ins/nt; i++)
   memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char
 *)DATAPTR(t),
  nt*sizeof(int));
   break;

 or at least with something like this:

 int* p_s = INTEGER(s) ;
 int* p_t = INTEGER(t) ;
 for( i=0 ; i  ns ; i++){
 p_s[i] = p_t[i % nt];
 }

 since expanding the INTEGER macro over and over has a price.

  and similar for the other types in copyVector.  This won't
 help regular
  vector copies, since those seem to be done by the
 DUPLICATE_ATOMIC_VECTOR
  macro, see next suggestion below, but it should help
 copyMatrix which calls
  copyVector, scan.c which calls copyVector on three lines,
 dcf.c (once) and
  dounzip.c (once).
 
  For the DUPLICATE_ATOMIC_VECTOR macro there is already a
 comment next to it
  :
 
   FIXME: surely memcpy would be faster here?
 
  which seems to refer to the for loop  :
 
   else { \
   int __i__; \
   type *__fp__ = fun(from), *__tp__ = fun(to); \
   for (__i__

[Rd] suggestion how to use memcpy in duplicate.c

2010-04-21 Thread Matthew Dowle

From copyVector in duplicate.c :

void copyVector(SEXP s, SEXP t)
{
int i, ns, nt;
nt = LENGTH(t);
ns = LENGTH(s);
switch (TYPEOF(s)) {
...
case INTSXP:
for (i = 0; i  ns; i++)
INTEGER(s)[i] = INTEGER(t)[i % nt];
break;
...

could that be replaced with :

case INTSXP:
for (i=0; ins/nt; i++)
memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char *)DATAPTR(t), 
nt*sizeof(int));
break;

and similar for the other types in copyVector.  This won't help regular 
vector copies, since those seem to be done by the DUPLICATE_ATOMIC_VECTOR 
macro, see next suggestion below, but it should help copyMatrix which calls 
copyVector, scan.c which calls copyVector on three lines, dcf.c (once) and 
dounzip.c (once).

For the DUPLICATE_ATOMIC_VECTOR macro there is already a comment next to it 
:

FIXME: surely memcpy would be faster here?

which seems to refer to the for loop  :

else { \
int __i__; \
type *__fp__ = fun(from), *__tp__ = fun(to); \
for (__i__ = 0; __i__  __n__; __i__++) \
  __tp__[__i__] = __fp__[__i__]; \
  } \

Could that loop be replaced by the following ?

   else { \
   memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), __n__*sizeof(type)); \
   }\

In the data.table package, dogroups.c uses this technique, so the principle 
is tested and works well so far.

Are there any road blocks preventing this change, or is anyone already 
working on it ?  If not then I'll try and test it (on Ubuntu 32bit) and 
submit patch with timings, as before.  Comments/pointers much appreciated.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Suggestion to add crantastic to resources section on posting guide

2010-03-05 Thread Matthew Dowle

Under the further resources section I'd like to suggest the following 
addition :

*  http://crantastic.org/ lists popular packages according to other users 
votes. Consider briefly reviewing the top 30 packages before posting to 
r-help since someone may have already released a package that solves your 
problem.

Thats just a straw man idea so I hope there will be answer, or discussion, 
either way.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] shash in unique.c

2010-03-05 Thread Matthew Dowle

I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No 
response resulted in a retry() after a Sys.sleep(10 days).

If its a yes or maybe then I could proceed to try it, test it, and 
present the test results and timings to you along with the patch.  It would 
be on 32bit Ubuntu first, and I would need to either buy, rent time on, or 
borrow a 64bit machine to be able to then test there, owing to the nature of 
the suggestion.

If its no, bad idea because... or we were already working on it, or 
better,  then I won't spend any more time on it.

Matthew


Matthew Dowle mdo...@mdowle.plus.com wrote in message 
news:hlu4qh$l7...@dough.gmane.org...

 Looking at shash in unique.c, from R-2.10.1  I'm wondering if it makes 
 sense to hash the pointer itself rather than the string it points to?
 In other words could the SEXP pointer be cast to unsigned int and the 
 usual scatter be called on that as if it were integer?

 shash would look like a slightly modified version of ihash like this :

 static int shash(SEXP x, int indx, HashData *d)
 {
if (STRING_ELT(x,indx) == NA_STRING) return 0;
return scatter((unsigned int) (STRING_ELT(x,indx), d);
 }

 rather than its current form which appears to hash the string it points to 
 :

 static int shash(SEXP x, int indx, HashData *d)
 {
unsigned int k;
const char *p;
if(d-useUTF8)
 p = translateCharUTF8(STRING_ELT(x, indx));
else
 p = translateChar(STRING_ELT(x, indx));
k = 0;
while (*p++)
 k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
return scatter(k, d);
 }

 Looking at sequal, below, and reading its comments, if the pointers are 
 equal it doesn't look at the strings they point to, which lead to the 
 question above.

 static int sequal(SEXP x, int i, SEXP y, int j)
 {
if (i  0 || j  0) return 0;
/* Two strings which have the same address must be the same,
   so avoid looking at the contents */
if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
/* Then if either is NA the other cannot be */
/* Once all CHARSXPs are cached, Seql will handle this */
if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
 return 0;
return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
 }

 Matthew


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] shash in unique.c

2010-03-05 Thread Matthew Dowle

Thanks a lot.  Quick and brief responses below...

Duncan Murdoch murd...@stats.uwo.ca wrote in message 
news:4b90f134.6070...@stats.uwo.ca...
 Matthew Dowle wrote:
 I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No 
 response resulted in a retry() after a Sys.sleep(10 days).

 If its a yes or maybe then I could proceed to try it, test it, and 
 present the test results and timings to you along with the patch.  It 
 would be on 32bit Ubuntu first, and I would need to either buy, rent time 
 on, or borrow a 64bit machine to be able to then test there, owing to the 
 nature of the suggestion.

 If its no, bad idea because... or we were already working on it, or 
 better,  then I won't spend any more time on it.

 Matthew


 Matthew Dowle mdo...@mdowle.plus.com wrote in message 
 news:hlu4qh$l7...@dough.gmane.org...

 Looking at shash in unique.c, from R-2.10.1  I'm wondering if it makes 
 sense to hash the pointer itself rather than the string it points to?
 In other words could the SEXP pointer be cast to unsigned int and the 
 usual scatter be called on that as if it were integer?

 Two negative but probably not fatal issues:

 Pointers and ints are not always the same size.  In Win64, ints are 32 
 bits, pointers are 64 bits.  (Can we be sure there is some integer type 
 the same size as a pointer?  I don't know, ask a C expert.)
No we can't be sure. But we could test at runtime, and if the assumption 
wasn't true, then revert to the existing method.


 We might want to save the hash to disk.  On restore, the pointer based 
 hash would be all wrong.  (I don't know if we actually do ever save a hash 
 to disk.  )
The hash table in unique.c appears to be a temporary private hash, different 
to the global R_StringHash. Its private hash appears to be used only while 
the call to unique runs, then free'd. Thats my understanding anyway. The 
suggestion is not to alter the global R_StringHash in any way at all,  which 
is the one that might be saved to disk now or in the future.


 Duncan Murdoch
 shash would look like a slightly modified version of ihash like this :

 static int shash(SEXP x, int indx, HashData *d)
 {
if (STRING_ELT(x,indx) == NA_STRING) return 0;
return scatter((unsigned int) (STRING_ELT(x,indx), d);
 }

 rather than its current form which appears to hash the string it points 
 to :

 static int shash(SEXP x, int indx, HashData *d)
 {
unsigned int k;
const char *p;
if(d-useUTF8)
 p = translateCharUTF8(STRING_ELT(x, indx));
else
 p = translateChar(STRING_ELT(x, indx));
k = 0;
while (*p++)
 k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
return scatter(k, d);
 }

 Looking at sequal, below, and reading its comments, if the pointers are 
 equal it doesn't look at the strings they point to, which lead to the 
 question above.

 static int sequal(SEXP x, int i, SEXP y, int j)
 {
if (i  0 || j  0) return 0;
/* Two strings which have the same address must be the same,
   so avoid looking at the contents */
if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
/* Then if either is NA the other cannot be */
/* Once all CHARSXPs are cached, Seql will handle this */
if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
 return 0;
return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
 }

 Matthew



 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Suggestion to add crantastic to resources section onposting guide

2010-03-05 Thread Matthew Dowle

That appears to be an epistemic error. Some people, and I would agree it 
seems like an increasing number of people, clearly don't read the posting 
guide.  However, it is impossible for anyone to know how many people do read 
it, do thoroughly read it and, therefore, don't ever need to post to r-help. 
Those people would be missing from the statistical sample of people who do 
post.  In fact it would be very surprising indeed,  assuming it is true that 
R is getting more popular,  to not see the numbers of non-compliant posters 
increase.

I dont believe in basing decisions upon poorly applied statistics. 
Especially ones that go from correlation to causation so casually.


Gabor Grothendieck ggrothendi...@gmail.com wrote in message 
news:971536df1003050433i7f104bd4l1e1421fab0d3...@mail.gmail.com...
I don't think we should be expanding the posting guide.  Its already
so long that no one reads it.  We should be thinking of ways to cut it
down to a smaller size instead.

On Fri, Mar 5, 2010 at 5:52 AM, Matthew Dowle mdo...@mdowle.plus.com 
wrote:
 Under the further resources section I'd like to suggest the following
 addition :

 * http://crantastic.org/ lists popular packages according to other users
 votes. Consider briefly reviewing the top 30 packages before posting to
 r-help since someone may have already released a package that solves your
 problem.

 Thats just a straw man idea so I hope there will be answer, or discussion,
 either way.

 Matthew

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] shash in unique.c

2010-02-22 Thread Matthew Dowle


Looking at shash in unique.c, from R-2.10.1  I'm wondering if it makes sense 
to hash the pointer itself rather than the string it points to?
In other words could the SEXP pointer be cast to unsigned int and the usual 
scatter be called on that as if it were integer?

shash would look like a slightly modified version of ihash like this :

static int shash(SEXP x, int indx, HashData *d)
{
if (STRING_ELT(x,indx) == NA_STRING) return 0;
return scatter((unsigned int) (STRING_ELT(x,indx), d);
}

rather than its current form which appears to hash the string it points to :

static int shash(SEXP x, int indx, HashData *d)
{
unsigned int k;
const char *p;
if(d-useUTF8)
 p = translateCharUTF8(STRING_ELT(x, indx));
else
 p = translateChar(STRING_ELT(x, indx));
k = 0;
while (*p++)
 k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
return scatter(k, d);
}

Looking at sequal, below, and reading its comments, if the pointers are 
equal it doesn't look at the strings they point to, which lead to the 
question above.

static int sequal(SEXP x, int i, SEXP y, int j)
{
if (i  0 || j  0) return 0;
/* Two strings which have the same address must be the same,
   so avoid looking at the contents */
if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
/* Then if either is NA the other cannot be */
/* Once all CHARSXPs are cached, Seql will handle this */
if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
 return 0;
return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
}

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Why is there no c.factor?

2010-02-05 Thread Matthew Dowle


 concat() doesn't get a lot of use
How do you know?  Maybe its used a lot but the users had no need to tell you 
what they were using. The exact opposite might in fact be the case i.e. 
because concat is so good in splus,  you just never hear of problems with it 
from the users. That might be a very good sign.

 perhaps that model would work well for a concatenation function in R
I'd be happy to test it. I'm a bit concerned about performance though given 
what you said about repeated recursive calls, and dispatch. Could you run 
the following test in s-plus please and post back the timing?  If this small 
100MB example was fine, then we could proceed to a 64bit 10GB test. This is 
quite nippy at the moment in R (1.1sec). I'd be happy with a better way as 
long as speed wasn't compromised.

set.seed(1)
L = as.vector(outer(LETTERS,LETTERS,paste,sep=))   # union set of 676 
levels
F = lapply(1:100, function(i) 
{# create 100 factors
   f = sample(1:100, 1*1024^2 / 4, replace=TRUE)   # each factor 
1MB large (262144 integers), plus small amount for the levels
   levels(f) = sample(L,100) 
# pick 100 levels from the union set
   class(f) = factor
   f
})

 head(F[[1]])
[1] RT DM CO JV BG KU
100 Levels: YC FO PN IL CB CY HQ ...
 head(F[[2]])
[1] RK PD FE SG SJ CQ
100 Levels: JV FV DX NL XB ND CY QQ ...


With c.factor from data.table, as posted, placed in .GlobalEnv

 system.time(G - do.call(c,F))
   user  system elapsed
   0.810.321.12
 head(G)
[1] RT DM CO JV BG KU# looks right, comparing to F[[1]] above
676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU 
AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
 G[262145:262150]
[1] RK PD FE SG SJ CQ  # looks right, comparing to F[[2]] above
676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU 
AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
 identical(as.character(G),as.character(unlist(F)))
[1] TRUE

So I guess this would be compared to following in splus ?

system.time(G - do.call(concat, F))

or maybe its just the following :

system.time(G - concat(F))

I don't have splus so I can't test that myself.


William Dunlap wdun...@tibco.com wrote in message 
news:77eb52c6dd32ba4d87471dcd70c8d7000275b...@na-pa-vbe03.na.tibco.com...
 -Original Message-
 From: r-devel-boun...@r-project.org
 [mailto:r-devel-boun...@r-project.org] On Behalf Of Peter Dalgaard
 Sent: Friday, February 05, 2010 7:41 AM
 To: Hadley Wickham
 Cc: John Fox; r-devel@r-project.org; Thomas Lumley
 Subject: Re: [Rd] Why is there no c.factor?

 Hadley Wickham wrote:
  On Thu, Feb 4, 2010 at 12:03 PM, Hadley Wickham
 had...@rice.edu wrote:
  I'd propose the following: If the sets of levels of all
 arguments are the
  same, then c.factor() would return a factor with the
 common set of levels;
  if the sets of levels differ, then, as Hadley suggests,
 the level-set of the
  result would be the union of sets of levels of the
 arguments, but a warning
  would be issued.
  I like this compromise (as long as there was an argument
 to suppress
  the warning)
 
  If I provided code to do this, along with the warnings for ordered
  factors and using the optimisation suggested by Matthew, is
 there any
  member of R core would be interested in sponsoring it?
 
  Hadley
 

 Messing with c() is a bit unattractive (I'm not too happy
 with the other
 c methods either; normally c() strips attributes and reduces
 to the base
 class, and those obviously do not), but a more general
 concat() function
 has been suggested a number of times. With a suitable range
 of methods,
 this could also be used to reimplement rbind.data.frame (which,
 incidentally, already contains a method for concatenating
 factors, with
 several ugly warts!)

Yes, c() should have been put on the deprecated list a couple
of decades ago, since people expect it to do too many
incompatible things.  And factor should have been a virtual
class, with subclasses FixedLevels (e.g., Sex) or AdHocLevels
(e.g., FamilyName), so c() and [()- could do the appropriate
thing in either case.

Back to reality, S+ has a concat(...) function, whose comments say
# This function works like c() except that names of arguments are
# ignored.  That is, it concatenates its arguments into a single
# S vector object, without considering the names of the arguments,
# in the order that the arguments are given.
#
# To make this function work for new classes, it is only necessary
# to make methods for the concat.two function, which concatenates
# two vectors; recursion will take care of the rest.
concat() is not generic but it repeatedly calls concat.two(x,y), an
SV4-generic that dispatches on the classes of x and y.  Thus you
can easily predict the class of concat(x,y,z), although it may not
be the same as the class of concat(z,y,x), given suitably bizarre
methods for concat.two().

concat() doesn't get a lot of use but I think the idea is sound.
Perhaps

Re: [Rd] Why is there no c.factor?

2010-02-04 Thread Matthew Dowle

A search for c.factor returns tons of hits on this topic.

Heres just one of the hits from 2006, when I asked the same question : 
http://tolstoy.newcastle.edu.au/R/e2/devel/06/11/1137.html

So it appears to be complicated and there are good reasons.
Since I needed it, I created c.factor in data.table package, below. It does 
it more efficiently since it doesn't convert each factor to character (hence 
losing some of the benefit). I've been told I'm not unique in this approach 
and that other packages also have their own c.factor.  It deliberately isn't 
exported.  Its worked well for me over the years anyway.

c.factor = function(...)
{
args - list(...)
for (i in seq(along=args)) if (!is.factor(args[[i]])) args[[i]] = 
as.factor(args[[i]])
# The first must be factor otherwise we wouldn't be inside c.factor, its 
checked anyway in the line above.
newlevels = sort(unique(unlist(lapply(args,levels
ans = unlist(lapply(args, function(x) {
m = match(levels(x), newlevels)
m[as.integer(x)]
}))
levels(ans) = newlevels
class(ans) = factor
ans
}

Hadley Wickham had...@rice.edu wrote in message 
news:f8e6ff051002040753x33282f33l78fce9f98dc29...@mail.gmail.com...
 Hi all,

 Is there are reason that there is no c.factor method?  Analogous to
 c.Date, I'd expect something like the following to be useful:

 c.factor - function(...) {
  factors - list(...)
  levels - unique(unlist(lapply(factors, levels)))
  char - unlist(lapply(factors, as.character))

  factor(char, levels = levels)
 }

 c(factor(a), factor(b), factor(c(c, b,a)), factor(d))
 # [1] a b c b a d
 # Levels: a b c d

 Hadley

 -- 
 http://had.co.nz/


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] wiki down?

2010-01-14 Thread Matthew Dowle

I see the same problem. The wiki link on the R homepage doesn't seem to 
respond.
A search of r-devel for subjects containing wiki finds this seemingly 
unanswered recent post.
Is it known?
-Matthew

Ben Bolker bol...@ufl.edu wrote in message 
news:4b44b12a.60...@ufl.edu...


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] split.data.frame

2009-12-17 Thread Matthew Dowle


This seems very similar to the data.table package.

The 'by' argument splits the data.table by that value then executes the j 
expression within each subset.  The package documentation talks about 
'subset' and 'with' in some detail. See ?[.data.table.

dt = data.table(x=1:20, y=rep(1:4,each=5)
dt[,sum(x),by=y]

 and x has a variable called grp, what do you get?
In data.table that choice is given to the user via the argument 'with' which 
by default is TRUE meaning you get the x inside dt.


Romain Francois romain.franc...@dbmail.com wrote in message 
news:4b288645.3010...@dbmail.com...
 On 12/16/2009 12:14 AM, Peter Dalgaard wrote:
 Romain Francois wrote:
 Hello,

 I very much enjoy with and subset semantics for data frames and
 was wondering if we could have something similar with split, basically
 by evaluating the second argument with the data frame :

 I seem to recall that this idea was considered and rejected when the
 current split.data.frame was written (10 years ago!). The main reasons
 were that

 - it's not really THAT hard to evaluate a single splitting expression
 using with() or eval()

 Sure, this is just about convenience and laziness.

 - not all applications will have the splitting factor inside the df to
 split ( split(df[-1], df[[1]]) for a simple case)

 this still works

 - if you need a computed splitting factor, there's a risk of inadvertent
 variable capture. I.e., if you inside a function do

 
 grp - ...whatever...
 spl - split(x, grp)
 

 and x has a variable called grp, what do you get?

 this is a problem indeed.

 thanks for the reply.

 Romain

 -- 
 Romain Francois
 Professional R Enthusiast
 +33(0) 6 28 91 30 30
 http://romainfrancois.blog.free.fr
 |- http://tr.im/HlX9 : new package : bibtex
 |- http://tr.im/Gq7i : ohloh
 `- http://tr.im/FtUu : new package : highlight


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Using svSocket with data.table

2009-08-07 Thread Matthew Dowle

Hi Olaf,

Thanks for your feedback, much appreciated.

 Don't be fooled. R does not handle multiple requests in parallel
 internally.
I wasn't fooled, but I've added some annotations to the video at the place
I might have given the impression I was (at 4min 39sec).  Later, at 
5min30sec I did already point out that the 'graph
stopped while the R server processed this clients request' but that is 
later.
http://www.youtube.com/watch?v=rvT8XThGA8o

 Also I suspect that, depending on what you do on the CLI, this will
 interact badly with svSocket.
Can you give an example to try out?

Regards, Matthew



Olaf Mersmann ol...@kimberly.tako.de wrote in message 
news:1248555172-sup-4...@bloxx.local...
 Hi Matthew,

 Excerpts from Matthew Dowle's message of Sat Jul 25 09:07:44 +0200 2009:
 So I'm looking to do the same as the demo,  but with a binary socket. 
 Does
 anyone have any ideas?  I've looked a bit at  Rserve, bigmemory, biocep, 
 nws
 but although all those packages are great,  I didn't find anything that
 worked in exactly this way i.e.  i) R to R ii) CLI non-blocking and iii) 
 no
 need to startup R in a special way

 Don't be fooled. R does not handle multiple requests in parallel
 internally. Also I suspect that, depending on what you do on the CLI,
 this will interact badly with svSocket.

 As far as binary transfer of R objects goes, you are probably looking
 for serialize() and unserialize(). Not sure if these are guaranteed to
 work across differen versions of R and different word sizes. See the
 Warnings section in the serialize manual page.

 Cheers
 Olaf


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] merge performace degradation in 2.9.1

2009-07-13 Thread Matthew Dowle

 Is there a way to avoid the degradation in performance in 2.9.1?
If the example is to demonstrate a difference between R versions that you 
really need to get to the bottom of then read no further.  However, if the 
example is actually what you want to do then you can speed it up by using a 
data.table as follows to reduce the 26 secs to 1 sec.

Time on my PC at home (quite old now!) :
 system.time(Out - merge(X, Y, by=mon, all=TRUE))
   user  system elapsed
  25.630.58   26.98

Using a data.table instead :
X - data.table(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N), 
key=mon)
Y - data.table(mon=month.abb, letter=letters[1:12], key=mon)
tables()
 NAME  NROW COLS   KEY
[1,] X1,200,000 group,mon  mon
[2,] Y   12 mon,letter mon
 system.time(X$letter - Y[X,letter])   # Y[X] is the syntax for merge of 
 two data.tables
   user  system elapsed
   0.980.111.10
 identical(Out$letter, X$letter)
[1] TRUE
 identical(Out$mon, X$mon)
[1] TRUE
 identical(Out$group, X$group)
[1] TRUE

To do the multi-column equi-join of X and Z, set a key of 2 columns. 
'nomatch' is the equivalent of 'all' and can be set to 0 (inner join) or NA 
(outer join).


Adrian Dragulescu adria...@eskimo.com wrote in message 
news:pine.lnx.4.64.0907090953580.1...@shell.eskimo.com...

 I have noticed a significant performance degradation using merge in 2.9.1 
 relative to 2.8.1.  Here is what I observed:

   N - 10
   X - data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), 
 each=N))
   X$mon - as.character(X$mon)
   Y - data.frame(mon=month.abb, letter=letters[1:12])
   Y$mon - as.character(Y$mon)

   Z - cbind(Y, group=1:12)

   system.time(Out - merge(X, Y, by=mon, all=TRUE))
   # R 2.8.1 is 17% faster than R 2.9.1 for N=10

   system.time(Out - merge(X, Z, by=c(mon, group), all=TRUE))
   # R 2.8.1 is 16% faster than R 2.9.1 for N=10

 Here is the head of summaryRprof() for 2.8.1
 $by.self
self.time self.pct total.time total.pct
 sort.list   4.60 56.5   4.60  56.5
 make.unique 1.68 20.6   2.18  26.8
 as.character0.50  6.1   0.50   6.1
 duplicated.default  0.50  6.1   0.50   6.1
 merge.data.frame0.20  2.5   8.02  98.5
 [.data.frame0.16  2.0   7.10  87.2

 and for 2.9.1
 $by.self
self.time self.pct total.time total.pct
 sort.list   4.66 39.2   4.66  39.2
 nchar   3.28 27.6   3.28  27.6
 make.unique 1.42 12.0   1.92  16.2
 as.character0.50  4.2   0.50   4.2
 data.frame  0.46  3.9   4.12  34.7
 [.data.frame0.44  3.7   7.28  61.3

 As you notice the 2.9.1 has an nchar entry that is quite time consuming.

 Is there a way to avoid the degradation in performance in 2.9.1?

 Thank you,
 Adrian

 As an aside, I got interested in testing merge in 2.9.1 by reading the 
 r-devel message from 30-May-2009 Degraded performance with rank() by Tim 
 Bergsma, as he mentions doing merges, but only today decided to test.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Can we generate exe file using R? What is the maximum file size valid?

2009-05-13 Thread Matthew Dowle


Does Ra get close to compiled R ?   The R code is compiled on the fly to 
bytecode which is executed internally by an interpreter in C.  The timing 
tests look impressive.
http://www.milbo.users.sonic.net/ra/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

1 2 >

1 - 100 of 109 matches

Mail list logo