Re: [R] Faster Subsetting

2016-09-28 Thread Dénes Tóth
Hi Harold, Generally: you can not beat data.table, unless you can represent your data in a matrix (or array or vector). For some specific cases, Hervé's suggestion might be also competitive. Your problem is that you did not put any effort to read at least part of the very extensive

Re: [R] Faster Subsetting

2016-09-28 Thread Martin Morgan
On 09/28/2016 02:53 PM, Hervé Pagès wrote: Hi, I'm surprised nobody suggested split(). Splitting the data.frame upfront is faster than repeatedly subsetting it: tmp <- data.frame(id = rep(1:2, each = 10), foo = rnorm(20)) idList <- unique(tmp$id) system.time(for (i in idList)

Re: [R] Faster Subsetting

2016-09-28 Thread Bert Gunter
"I'm surprised nobody suggested split(). " I did. by() is a data frame oriented version of tapply(), which uses split(). Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom

Re: [R] Faster Subsetting

2016-09-28 Thread Hervé Pagès
Hi, I'm surprised nobody suggested split(). Splitting the data.frame upfront is faster than repeatedly subsetting it: tmp <- data.frame(id = rep(1:2, each = 10), foo = rnorm(20)) idList <- unique(tmp$id) system.time(for (i in idList) tmp[which(tmp$id == i),]) # user system

Re: [R] Faster Subsetting

2016-09-28 Thread Weiser, Dr. Constantin
eplicate(500, subset(tmp2, id == idList[1]))) > > From: Dominik Schneider [mailto:dosc3...@colorado.edu] > Sent: Wednesday, September 28, 2016 12:27 PM > To: Doran, Harold <hdo...@air.org> > Cc: r-help@r-project.org > Subject: Re: [R] Faster Subsetting > > I regularly crunch

Re: [R] Faster Subsetting

2016-09-28 Thread Dominik Schneider
I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold wrote: > I have an extremely large data frame (~13

Re: [R] Faster Subsetting

2016-09-28 Thread Dominik Schneider
I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold wrote: > I have an extremely large data frame (~13

Re: [R] Faster Subsetting

2016-09-28 Thread Enrico Schumann
On Wed, 28 Sep 2016, "Doran, Harold" writes: > I have an extremely large data frame (~13 million rows) that resembles > the structure of the object tmp below in the reproducible code. In my > real data, the variable, 'id' may or may not be ordered, but I think > that is

Re: [R] Faster Subsetting

2016-09-28 Thread Bert Gunter
each time compared to the indexing method. > > Perhaps I'm using it incorrectly? > > > > -Original Message- > From: Constantin Weiser [mailto:constantin.wei...@hhu.de] > Sent: Wednesday, September 28, 2016 12:55 PM > To: r-help@r-project.org > Cc: Doran, Harold &

Re: [R] Faster Subsetting

2016-09-28 Thread Doran, Harold
compared to the indexing method. Perhaps I'm using it incorrectly? -Original Message- From: Constantin Weiser [mailto:constantin.wei...@hhu.de] Sent: Wednesday, September 28, 2016 12:55 PM To: r-help@r-project.org Cc: Doran, Harold <hdo...@air.org> Subject: Re: [R] Faster Subsett

Re: [R] Faster Subsetting

2016-09-28 Thread Bert Gunter
<- as.data.table(tmp) # data.table > > system.time(replicate(500, tmp2[which(tmp$id == idList[1]),])) > > system.time(replicate(500, subset(tmp2, id == idList[1]))) > > From: Dominik Schneider [mailto:dosc3...@colorado.edu] > Sent: Wednesday, September 28, 2016 12:27

Re: [R] Faster Subsetting

2016-09-28 Thread ruipbarradas
Hello, If you work with a matrix instead of a data.frame, it usually runs faster, but your column vectors must all be numeric. ### Fast, but not fast enough system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) user system elapsed 0.050.000.04 ### Not fast at all,

Re: [R] Faster Subsetting

2016-09-28 Thread Doran, Harold
Schneider [mailto:dosc3...@colorado.edu] Sent: Wednesday, September 28, 2016 12:27 PM To: Doran, Harold <hdo...@air.org> Cc: r-help@r-project.org Subject: Re: [R] Faster Subsetting I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. The

[R] Faster Subsetting

2016-09-28 Thread Doran, Harold
I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. I have a process that requires subsetting the data by id and then