Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
A bit late and possibly tangential. The mmap package has something called struct() which is really a row-wise array of heterogenous columns. As Simon and others have pointed out, R has no way to handle this natively, but mmap does provide a very measurable performance gain by orienting rows together in memory (mapped memory to be specific). Since it is all outside of R so to speak, it (mmap) even supports many non-native types, from bit vectors to 64 bit ints with conversion caveats applicable. example(struct) shows some performance gains with this approach. There are even some crude methods to convert as is data.frames to mmap struct object directly (hint: as.mmap) Again, likely not enough to shoehorn into your effort, but worth a look to see if it might be useful, and/or see the C design underlying it. Best, Jeff Jeffrey Ryan|Founder|jeffrey.r...@lemnica.com www.lemnica.com On May 1, 2012, at 1:44 PM, Antonio Piccolboni anto...@piccolboni.info wrote: On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek simon.urba...@r-project.orgwrote: On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info wrote: It seems like people need to hear more context, happy to provide it. I am implementing a serialization format (typedbytes, HADOOP-1722 if people want the gory details) to make R and Hadoop interoperate better (RHadoop project, package rmr). It is a row first format and it's already implemented as a C extension for R for lists and atomic vectors, where each element of a vector is a row. I need to extend it to accept data frames and I was wondering if I can use the existing C code by converting a data frame to a list of its rows. It sounds like the answer is that it is not a good idea, Just think about it -- data frames are lists of *columns* because the type of each column is fixed. Treating them row-wise is extremely inefficient, because you can't use any vector type to represent such thing (other than a generic vector containing vectors of length 1). Thanks, let's say this together with the experiments and other converging opinions lays the question to rest. that's helpful too in a way because it restricts the options. I thought I may be missing a simple primitive, like a t() for data frames (that doesn't coerce to matrix). See above - I think you are misunderstanding data frames - t() makes no sense for data frames. I think you are misunderstanding my use of t(). Thanks Antonio Cheers, Simon On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: On 01/05/2012 00:28, Antonio Piccolboni wrote: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere Unsurprising when you create three orders of magnitude more data frames, is it? That's a list of 2000 data frames. Try system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = paste0(x, i))) that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks You need to re-think your data structures: 1-row data frames are not sensible. Antonio [[alternative HTML version deleted]] __** R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-devel https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~**ripley/ http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org
Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
Antonio Piccolboni antonio at piccolboni.info writes: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks Antonio Perhaps r-help or Stack Overflow would have been more appropriate to try first, before r-devel. If you did, please say so. Answering anyway. Do you really want to split every single row? What's the bigger picture? Perhaps you don't need to split at all. On the off chance that the example was just for exposition, and applying some (biased) guesswork, have you seen the data.table package? It doesn't use the split-apply-combine paradigm because, as your (extreme) example shows, that doesn't scale. When you use the 'by' argument of [.data.table, it allocates memory once for the largest group. Then it reuses that same memory for each group. That's one reason it's fast and memory efficient at grouping (an order of magnitude faster than tapply). Independent timings : http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/ If you really do want to split every single row, then DT[,something,by=1:nrow(DT)] will give perhaps two orders of magnitude speedup, but that's an unfair example because it isn't very realistic. Scaling applies to the size of the data.frame, and, how much you want to split it up. Your example is extreme in the latter but not the former. data.table scales in both. It's nothing to do with the interpreter, btw, just memory usage. Matthew __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
On 01/05/2012 00:28, Antonio Piccolboni wrote: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere Unsurprising when you create three orders of magnitude more data frames, is it? That's a list of 2000 data frames. Try system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = paste0(x, i))) that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks You need to re-think your data structures: 1-row data frames are not sensible. Antonio [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
It seems like people need to hear more context, happy to provide it. I am implementing a serialization format (typedbytes, HADOOP-1722 if people want the gory details) to make R and Hadoop interoperate better (RHadoop project, package rmr). It is a row first format and it's already implemented as a C extension for R for lists and atomic vectors, where each element of a vector is a row. I need to extend it to accept data frames and I was wondering if I can use the existing C code by converting a data frame to a list of its rows. It sounds like the answer is that it is not a good idea, that's helpful too in a way because it restricts the options. I thought I may be missing a simple primitive, like a t() for data frames (that doesn't coerce to matrix). Thanks Antonio On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.ukwrote: On 01/05/2012 00:28, Antonio Piccolboni wrote: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere Unsurprising when you create three orders of magnitude more data frames, is it? That's a list of 2000 data frames. Try system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = paste0(x, i))) that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks You need to re-think your data structures: 1-row data frames are not sensible. Antonio [[alternative HTML version deleted]] __** R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-develhttps://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~**ripley/http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info wrote: It seems like people need to hear more context, happy to provide it. I am implementing a serialization format (typedbytes, HADOOP-1722 if people want the gory details) to make R and Hadoop interoperate better (RHadoop project, package rmr). It is a row first format and it's already implemented as a C extension for R for lists and atomic vectors, where each element of a vector is a row. I need to extend it to accept data frames and I was wondering if I can use the existing C code by converting a data frame to a list of its rows. It sounds like the answer is that it is not a good idea, Just think about it -- data frames are lists of *columns* because the type of each column is fixed. Treating them row-wise is extremely inefficient, because you can't use any vector type to represent such thing (other than a generic vector containing vectors of length 1). that's helpful too in a way because it restricts the options. I thought I may be missing a simple primitive, like a t() for data frames (that doesn't coerce to matrix). See above - I think you are misunderstanding data frames - t() makes no sense for data frames. Cheers, Simon On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.ukwrote: On 01/05/2012 00:28, Antonio Piccolboni wrote: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere Unsurprising when you create three orders of magnitude more data frames, is it? That's a list of 2000 data frames. Try system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = paste0(x, i))) that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks You need to re-think your data structures: 1-row data frames are not sensible. Antonio [[alternative HTML version deleted]] __** R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-develhttps://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~**ripley/http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek simon.urba...@r-project.orgwrote: On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info wrote: It seems like people need to hear more context, happy to provide it. I am implementing a serialization format (typedbytes, HADOOP-1722 if people want the gory details) to make R and Hadoop interoperate better (RHadoop project, package rmr). It is a row first format and it's already implemented as a C extension for R for lists and atomic vectors, where each element of a vector is a row. I need to extend it to accept data frames and I was wondering if I can use the existing C code by converting a data frame to a list of its rows. It sounds like the answer is that it is not a good idea, Just think about it -- data frames are lists of *columns* because the type of each column is fixed. Treating them row-wise is extremely inefficient, because you can't use any vector type to represent such thing (other than a generic vector containing vectors of length 1). Thanks, let's say this together with the experiments and other converging opinions lays the question to rest. that's helpful too in a way because it restricts the options. I thought I may be missing a simple primitive, like a t() for data frames (that doesn't coerce to matrix). See above - I think you are misunderstanding data frames - t() makes no sense for data frames. I think you are misunderstanding my use of t(). Thanks Antonio Cheers, Simon On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: On 01/05/2012 00:28, Antonio Piccolboni wrote: Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere Unsurprising when you create three orders of magnitude more data frames, is it? That's a list of 2000 data frames. Try system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = paste0(x, i))) that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks You need to re-think your data structures: 1-row data frames are not sensible. Antonio [[alternative HTML version deleted]] __** R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-devel https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~**ripley/ http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
Hi, I was wondering if there is anything more efficient than split to do the kind of conversion in the subject. If I create a data frame as in system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste(x, 1:2000, sep =))}) user system elapsed 0.004 0.000 0.004 and then I try to split it system.time(split(fd, 1:nrow(fd))) user system elapsed 0.333 0.031 0.415 You will be quick to notice the roughly two orders of magnitude difference in time between creation and conversion. Granted, it's not written anywhere that they should be similar but the latter seems interpreter-slow to me (split is implemented with a lapply in the data frame case) There is also a memory issue when I hit about 2 elements (allocating 3GB when interrupted). So before I resort to Rcpp, despite the electrifying feeling of approaching the bare metal and for the sake of getting things done, I thought I would ask the experts. Thanks Antonio [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel