Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-03 Thread Jeff Ryan
A bit late and possibly tangential. 

The mmap package has something called struct() which is really a row-wise array 
of heterogenous columns.

As Simon and others have pointed out, R has no way to handle this natively, but 
mmap does provide a very measurable performance gain by orienting rows together 
in memory (mapped memory to be specific).  Since it is all outside of R so to 
speak, it (mmap) even supports many non-native types, from bit vectors to 64 
bit ints with conversion caveats applicable. 

example(struct) shows some performance gains with this approach. 

There are even some crude methods to convert as is data.frames to mmap struct 
object directly (hint: as.mmap)

Again, likely not enough to shoehorn into your effort, but worth a look to see 
if it might be useful, and/or see the C design underlying it. 

Best,
Jeff

Jeffrey Ryan|Founder|jeffrey.r...@lemnica.com

www.lemnica.com

On May 1, 2012, at 1:44 PM, Antonio Piccolboni anto...@piccolboni.info wrote:

 On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek
 simon.urba...@r-project.orgwrote:
 
 
 On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info
 wrote:
 
 It seems like people need to hear more context, happy to provide it. I am
 implementing a serialization format (typedbytes, HADOOP-1722 if people
 want
 the gory details) to make R and Hadoop interoperate better (RHadoop
 project, package rmr). It is a row first format and it's already
 implemented as a C extension for R for lists and atomic vectors, where
 each
 element  of a vector is a row. I need to extend it to accept data frames
 and I was wondering if I can use the existing C code by converting a data
 frame to a list of its rows. It sounds like the answer is that it is not
 a
 good idea,
 
 Just think about it -- data frames are lists of *columns* because the type
 of each column is fixed. Treating them row-wise is extremely inefficient,
 because you can't use any vector type to represent such thing (other than a
 generic vector containing vectors of length 1).
 
 
 Thanks, let's say this together with the experiments and other converging
 opinions lays the question to rest.
 
 
 that's helpful too in a way because it restricts the options. I
 thought I may be missing a simple primitive, like a t() for data frames
 (that doesn't coerce to matrix).
 
 See above - I think you are misunderstanding data frames - t() makes no
 sense for data frames.
 
 
 I think you are misunderstanding my use of t(). Thanks
 
 
 Antonio
 
 
 
 Cheers,
 Simon
 
 
 
 On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.uk
 wrote:
 
 On 01/05/2012 00:28, Antonio Piccolboni wrote:
 
 Hi,
 I was wondering if there is anything more efficient than split to do
 the
 kind of conversion in the subject. If I create a data frame as in
 
 system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
 paste(x,
 1:2000, sep =))})
 user  system elapsed
 0.004   0.000   0.004
 
 and then I try to split it
 
 system.time(split(fd, 1:nrow(fd)))
 
 user  system elapsed
 0.333   0.031   0.415
 
 
 You will be quick to notice the roughly two orders of magnitude
 difference
 in time between creation and conversion. Granted, it's not written
 anywhere
 
 
 Unsurprising when you create three orders of magnitude more data frames,
 is it?  That's a list of 2000 data frames.  Try
 
 system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
 paste0(x, i)))
 
 
 
 that they should be similar but the latter seems interpreter-slow to me
 (split is implemented with a lapply in the data frame case) There is
 also
 a
 memory issue when I hit about 2 elements (allocating 3GB when
 interrupted). So before I resort to Rcpp, despite the electrifying
 feeling
 of approaching the bare metal and for the sake of getting things done,
 I
 thought I would ask the experts. Thanks
 
 
 You need to re-think your data structures: 1-row data frames are not
 sensible.
 
 
 
 
 Antonio
 
 [[alternative HTML version deleted]]
 
 
 __**
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-devel
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
 --
 Brian D. Ripley,  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/
 http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595
 
 
 [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
 
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org 

Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Matthew Dowle

Antonio Piccolboni antonio at piccolboni.info writes:
 Hi,
 I was wondering if there is anything more efficient than split to do the
 kind of conversion in the subject. If I create a data frame as in
 
 system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste(x,
 1:2000, sep =))})
   user  system elapsed
   0.004   0.000   0.004
 
 and then I try to split it
 
  system.time(split(fd, 1:nrow(fd)))
user  system elapsed
   0.333   0.031   0.415
 
 You will be quick to notice the roughly two orders of magnitude difference
 in time between creation and conversion. Granted, it's not written anywhere
 that they should be similar but the latter seems interpreter-slow to me
 (split is implemented with a lapply in the data frame case) There is also a
 memory issue when I hit about 2 elements (allocating 3GB when
 interrupted). So before I resort to Rcpp, despite the electrifying feeling
 of approaching the bare metal and for the sake of getting things done, I
 thought I would ask the experts. Thanks
 
 Antonio

Perhaps r-help or Stack Overflow would have been more appropriate to try first, 
before r-devel. If you did, please say so.

Answering anyway. Do you really want to split every single row? What's the 
bigger picture? Perhaps you don't need to split at all.

On the off chance that the example was just for exposition, and applying some 
(biased) guesswork, have you seen the data.table package? It doesn't use the 
split-apply-combine paradigm because, as your (extreme) example shows, that 
doesn't scale. When you use the 'by' argument of [.data.table, it allocates 
memory once for the largest group. Then it reuses that same memory for each 
group. That's one reason it's fast and memory efficient at grouping (an order 
of magnitude faster than tapply).

Independent timings :
http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/

If you really do want to split every single row, then
DT[,something,by=1:nrow(DT)]
will give perhaps two orders of magnitude speedup, but that's an unfair example 
because it isn't very realistic. Scaling applies to the size of the data.frame, 
and, how much you want to split it up. Your example is extreme in the latter 
but not the former. data.table scales in both.

It's nothing to do with the interpreter, btw, just memory usage.

Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Prof Brian Ripley

On 01/05/2012 00:28, Antonio Piccolboni wrote:

Hi,
I was wondering if there is anything more efficient than split to do the
kind of conversion in the subject. If I create a data frame as in

system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste(x,
1:2000, sep =))})
   user  system elapsed
   0.004   0.000   0.004

and then I try to split it


system.time(split(fd, 1:nrow(fd)))

user  system elapsed
   0.333   0.031   0.415


You will be quick to notice the roughly two orders of magnitude difference
in time between creation and conversion. Granted, it's not written anywhere


Unsurprising when you create three orders of magnitude more data frames, 
is it?  That's a list of 2000 data frames.  Try


system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = 
paste0(x, i)))




that they should be similar but the latter seems interpreter-slow to me
(split is implemented with a lapply in the data frame case) There is also a
memory issue when I hit about 2 elements (allocating 3GB when
interrupted). So before I resort to Rcpp, despite the electrifying feeling
of approaching the bare metal and for the sake of getting things done, I
thought I would ask the experts. Thanks


You need to re-think your data structures: 1-row data frames are not 
sensible.






Antonio

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Antonio Piccolboni
It seems like people need to hear more context, happy to provide it. I am
implementing a serialization format (typedbytes, HADOOP-1722 if people want
the gory details) to make R and Hadoop interoperate better (RHadoop
project, package rmr). It is a row first format and it's already
implemented as a C extension for R for lists and atomic vectors, where each
element  of a vector is a row. I need to extend it to accept data frames
and I was wondering if I can use the existing C code by converting a data
frame to a list of its rows. It sounds like the answer is that it is not a
good idea, that's helpful too in a way because it restricts the options. I
thought I may be missing a simple primitive, like a t() for data frames
(that doesn't coerce to matrix). Thanks

Antonio

On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.ukwrote:

 On 01/05/2012 00:28, Antonio Piccolboni wrote:

 Hi,
 I was wondering if there is anything more efficient than split to do the
 kind of conversion in the subject. If I create a data frame as in

 system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste(x,
 1:2000, sep =))})
   user  system elapsed
   0.004   0.000   0.004

 and then I try to split it

  system.time(split(fd, 1:nrow(fd)))

user  system elapsed
   0.333   0.031   0.415


 You will be quick to notice the roughly two orders of magnitude difference
 in time between creation and conversion. Granted, it's not written
 anywhere


 Unsurprising when you create three orders of magnitude more data frames,
 is it?  That's a list of 2000 data frames.  Try

 system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
 paste0(x, i)))



  that they should be similar but the latter seems interpreter-slow to me
 (split is implemented with a lapply in the data frame case) There is also
 a
 memory issue when I hit about 2 elements (allocating 3GB when
 interrupted). So before I resort to Rcpp, despite the electrifying feeling
 of approaching the bare metal and for the sake of getting things done, I
 thought I would ask the experts. Thanks


 You need to re-think your data structures: 1-row data frames are not
 sensible.




 Antonio

[[alternative HTML version deleted]]


 __**
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-develhttps://stat.ethz.ch/mailman/listinfo/r-devel



 --
 Brian D. Ripley,  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  
 http://www.stats.ox.ac.uk/~**ripley/http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Simon Urbanek

On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info wrote:

 It seems like people need to hear more context, happy to provide it. I am
 implementing a serialization format (typedbytes, HADOOP-1722 if people want
 the gory details) to make R and Hadoop interoperate better (RHadoop
 project, package rmr). It is a row first format and it's already
 implemented as a C extension for R for lists and atomic vectors, where each
 element  of a vector is a row. I need to extend it to accept data frames
 and I was wondering if I can use the existing C code by converting a data
 frame to a list of its rows. It sounds like the answer is that it is not a
 good idea,

Just think about it -- data frames are lists of *columns* because the type of 
each column is fixed. Treating them row-wise is extremely inefficient, because 
you can't use any vector type to represent such thing (other than a generic 
vector containing vectors of length 1).


 that's helpful too in a way because it restricts the options. I
 thought I may be missing a simple primitive, like a t() for data frames
 (that doesn't coerce to matrix).

See above - I think you are misunderstanding data frames - t() makes no sense 
for data frames.

Cheers,
Simon



 On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley 
 rip...@stats.ox.ac.ukwrote:
 
 On 01/05/2012 00:28, Antonio Piccolboni wrote:
 
 Hi,
 I was wondering if there is anything more efficient than split to do the
 kind of conversion in the subject. If I create a data frame as in
 
 system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste(x,
 1:2000, sep =))})
  user  system elapsed
  0.004   0.000   0.004
 
 and then I try to split it
 
 system.time(split(fd, 1:nrow(fd)))
 
   user  system elapsed
  0.333   0.031   0.415
 
 
 You will be quick to notice the roughly two orders of magnitude difference
 in time between creation and conversion. Granted, it's not written
 anywhere
 
 
 Unsurprising when you create three orders of magnitude more data frames,
 is it?  That's a list of 2000 data frames.  Try
 
 system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
 paste0(x, i)))
 
 
 
 that they should be similar but the latter seems interpreter-slow to me
 (split is implemented with a lapply in the data frame case) There is also
 a
 memory issue when I hit about 2 elements (allocating 3GB when
 interrupted). So before I resort to Rcpp, despite the electrifying feeling
 of approaching the bare metal and for the sake of getting things done, I
 thought I would ask the experts. Thanks
 
 
 You need to re-think your data structures: 1-row data frames are not
 sensible.
 
 
 
 
 Antonio
 
   [[alternative HTML version deleted]]
 
 
 __**
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-develhttps://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
 --
 Brian D. Ripley,  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  
 http://www.stats.ox.ac.uk/~**ripley/http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595
 
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-01 Thread Antonio Piccolboni
On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek
simon.urba...@r-project.orgwrote:


 On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info
 wrote:

  It seems like people need to hear more context, happy to provide it. I am
  implementing a serialization format (typedbytes, HADOOP-1722 if people
 want
  the gory details) to make R and Hadoop interoperate better (RHadoop
  project, package rmr). It is a row first format and it's already
  implemented as a C extension for R for lists and atomic vectors, where
 each
  element  of a vector is a row. I need to extend it to accept data frames
  and I was wondering if I can use the existing C code by converting a data
  frame to a list of its rows. It sounds like the answer is that it is not
 a
  good idea,

 Just think about it -- data frames are lists of *columns* because the type
 of each column is fixed. Treating them row-wise is extremely inefficient,
 because you can't use any vector type to represent such thing (other than a
 generic vector containing vectors of length 1).


Thanks, let's say this together with the experiments and other converging
opinions lays the question to rest.


   that's helpful too in a way because it restricts the options. I
  thought I may be missing a simple primitive, like a t() for data frames
  (that doesn't coerce to matrix).

 See above - I think you are misunderstanding data frames - t() makes no
 sense for data frames.


I think you are misunderstanding my use of t(). Thanks


Antonio



 Cheers,
 Simon



  On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.uk
 wrote:
 
  On 01/05/2012 00:28, Antonio Piccolboni wrote:
 
  Hi,
  I was wondering if there is anything more efficient than split to do
 the
  kind of conversion in the subject. If I create a data frame as in
 
  system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
 paste(x,
  1:2000, sep =))})
   user  system elapsed
   0.004   0.000   0.004
 
  and then I try to split it
 
  system.time(split(fd, 1:nrow(fd)))
 
user  system elapsed
   0.333   0.031   0.415
 
 
  You will be quick to notice the roughly two orders of magnitude
 difference
  in time between creation and conversion. Granted, it's not written
  anywhere
 
 
  Unsurprising when you create three orders of magnitude more data frames,
  is it?  That's a list of 2000 data frames.  Try
 
  system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
  paste0(x, i)))
 
 
 
  that they should be similar but the latter seems interpreter-slow to me
  (split is implemented with a lapply in the data frame case) There is
 also
  a
  memory issue when I hit about 2 elements (allocating 3GB when
  interrupted). So before I resort to Rcpp, despite the electrifying
 feeling
  of approaching the bare metal and for the sake of getting things done,
 I
  thought I would ask the experts. Thanks
 
 
  You need to re-think your data structures: 1-row data frames are not
  sensible.
 
 
 
 
  Antonio
 
[[alternative HTML version deleted]]
 
 
  __**
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/**listinfo/r-devel
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
  --
  Brian D. Ripley,  rip...@stats.ox.ac.uk
  Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/
 http://www.stats.ox.ac.uk/~ripley/
  University of Oxford, Tel:  +44 1865 272861 (self)
  1 South Parks Road, +44 1865 272866 (PA)
  Oxford OX1 3TG, UKFax:  +44 1865 272595
 
 
[[alternative HTML version deleted]]
 
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel
 
 



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-04-30 Thread Antonio Piccolboni
Hi,
I was wondering if there is anything more efficient than split to do the
kind of conversion in the subject. If I create a data frame as in

system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste(x,
1:2000, sep =))})
  user  system elapsed
  0.004   0.000   0.004

and then I try to split it

 system.time(split(fd, 1:nrow(fd)))
   user  system elapsed
  0.333   0.031   0.415


You will be quick to notice the roughly two orders of magnitude difference
in time between creation and conversion. Granted, it's not written anywhere
that they should be similar but the latter seems interpreter-slow to me
(split is implemented with a lapply in the data frame case) There is also a
memory issue when I hit about 2 elements (allocating 3GB when
interrupted). So before I resort to Rcpp, despite the electrifying feeling
of approaching the bare metal and for the sake of getting things done, I
thought I would ask the experts. Thanks


Antonio

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel