Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Matthew Dowle
Stavros Macrakis macrakis at alum.mit.edu writes: data.table certainly has some useful mechanisms, and I've been experimenting with it as an implementation mechanism, though it's not a drop-in substitute for factors. Also, though it is efficient for set operations between small sets and

Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Milan Bouchet-Valat
Le dimanche 06 novembre 2011 à 19:00 -0500, Stavros Macrakis a écrit : Milan, Jeff, Patrick, Thank you for your comments and suggestions. Milan, This is far from a completely theoretical problem. I am performing text analytics on a corpus of about 2m documents. There are tens of

Re: [Rd] Efficiency of factor objects

2011-11-07 Thread Stavros Macrakis
Matthew, Yes, the case I am thinking of is a 1-column key; sorry for the overgeneralization. I haven't thought much about the multi-column key case. -s On Mon, Nov 7, 2011 at 12:48, Matthew Dowle mdo...@mdowle.plus.com wrote: Stavros Macrakis macrakis at alum.mit.edu writes:

[Rd] Efficiency of factor objects

2011-11-06 Thread Stavros Macrakis
Milan, Jeff, Patrick, Thank you for your comments and suggestions. Milan, This is far from a completely theoretical problem. I am performing text analytics on a corpus of about 2m documents. There are tens of thousands of distinct words (lemmata). It seems to me that the natural

Re: [Rd] Efficiency of factor objects

2011-11-05 Thread Patrick Burns
Perhaps 'data.table' would be a package on CRAN that would be acceptable. On 05/11/2011 16:45, Jeffrey Ryan wrote: Or better still, extend R via the mechanisms in place. Something akin to a fast factor package. Any change to R causes downstream issues in (hundreds of?) millions of lines of

[Rd] Efficiency of factor objects

2011-11-04 Thread Stavros Macrakis
R factors are the natural way to represent factors -- and should be efficient since they use small integers. But in fact, for many (but not all) operations, R factors are considerably slower than integers, or even character strings. This appears to be because whenever a factor vector is