Re: [R] slow computation of functions over large datasets
Hi all, After reading this interesting discussion I delved a bit deeper into the subject matter. The following snippet of code (see the end of my mail) compares three ways of performing this task, using ddply, ave and one yet unmentioned option: data.table (a package). The piece of code generates mock datasets which vary in size and number of factor levels for the factor. The results look like this (there is also a ggplot plot in the script that summarise the table): res datsize noClasses tave tddply tdata.table ...note that I cut out part of the table for readability... 17 1e+0710 9.160 3.500 1.064 18 1e+0750 10.126 4.483 1.364 19 1e+07 100 10.485 5.016 1.407 20 1e+07 200 10.680 6.901 1.435 21 1e+07 500 10.801 12.569 1.474 22 1e+07 1000 10.923 21.001 1.540 23 1e+07 2500 11.514 51.020 1.622 24 1e+07 1 12.158 182.752 1.737 It is clear that the option of using data.table is by far the fastest of the three and scales quite nicely with the number of factor levels, in contrast to ddply. It is also faster than ave by up to a factor of 10. cheers, Paul library(ggplot2) library(data.table) theme_set(theme_bw()) datsize = c(10e4, 10e5, 10e6) noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3) comb = expand.grid(datsize = datsize, noClasses = noClasses) res = ddply(comb, .(datsize, noClasses), function(x) { expdata = data.frame(value = runif(x$datsize), cat = round(runif(x$datsize, min = 0, max = x$noClasses))) expdataDT = data.table(expdata) t1 = system.time(res1 - with(expdata, ave(value, cat, FUN = sum))) t2 = system.time(res2 - ddply(expdata, .(cat), summarise, val = sum(value))) t3 = system.time(res3 - expdataDT[, sum(value), by = cat]) return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3])) }, .progress = 'text') res ggplot(aes(x = noClasses, y = log(value), color = variable), data = melt(res, id.vars = c(datsize,noClasses))) + facet_wrap(~ datsize) + geom_line() sessionInfo() R version 2.13.0 (2011-04-13) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] data.table_1.6.3 ggplot2_0.8.9proto_0.3-8 reshape_0.8.4 [5] plyr_1.5.2 fortunes_1.4-1 loaded via a namespace (and not attached): [1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0 On 08/03/2011 01:25 PM, Caroline Faisst wrote: Hello there, I'm computing the total value of an order from the price of the order items using a for loop and the ifelse function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Paul Hiemstra, Ph.D. Global Climate Division Royal Netherlands Meteorological Institute (KNMI) Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 P.O. Box 201 | 3730 AE | De Bilt tel: +31 30 2206 494 http://intamap.geo.uu.nl/~paul http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770 [[alternative HTML version deleted]] __ R-help@r-project.org
Re: [R] slow computation of functions over large datasets
Dear Caroline, Here is a faster and more elegant solution. n - 1 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) library(plyr) system.time({ + ddply(exampledata, .(orderID), function(x){ + data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice)) + }) + }) user system elapsed 1.670.001.69 exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) + {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])}) user system elapsed 11.940.02 11.97 Best regards, Thierry -Oorspronkelijk bericht- Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] Namens Caroline Faisst Verzonden: woensdag 3 augustus 2011 15:26 Aan: r-help@r-project.org Onderwerp: [R] slow computation of functions over large datasets Hello there, I'm computing the total value of an order from the price of the order items using a for loop and the ifelse function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata- data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]- ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i- 1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2- data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,1 0,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]- ifelse(exampledata2[i,orderID]==exampledata2[i- 1,orderID],exampledata2[i- 1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote: Hello there, Im computing the total value of an order from the price of the order items using a for loop and the ifelse function. Ouch. Schools really should stop teaching SAS and BASIC as a first language. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata- data .frame (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]- ifelse (exampledata [i ,orderID ]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] +exampledata[i,itemPrice],exampledata[i,itemPrice])}) Try instead using 'ave' to calculate a cumulative 'sum' within orderID: exampledata$orderAmt - with(exampledata, ave(itemPrice, orderID, FUN=cumsum) ) I assure you this will be more reproducible, faster, and understandable. # large dataset: medium dataset really. Barely nudges the RAM dial on my machine. the very same computational task takes much longer exampledata2- data .frame (orderID = c (1,1,1,2,2,3,3,3,4,5 :200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]- ifelse (exampledata2 [i ,orderID ]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] +exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) system.time( exampledata2$orderAmt - with(exampledata2, ave(itemPrice, orderID, FUN=cumsum) ) ) user system elapsed 35.106 0.811 35.822 On a three year-old machine. Not as fast as I expected, but not long enough to require refilling the coffee cup either. -- David. Does someone know a way to increase the speed? -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
On Aug 3, 2011, at 9:59 AM, ONKELINX, Thierry wrote: Dear Caroline, Here is a faster and more elegant solution. n - 1 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) library(plyr) system.time({ + ddply(exampledata, .(orderID), function(x){ + data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x $itemPrice)) + }) + }) user system elapsed 1.670.001.69 exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) + {exampledata[i,orderAmount]- ifelse (exampledata [i ,orderID ]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] +exampledata[i,itemPrice],exampledata[i,itemPrice])}) user system elapsed 11.940.02 11.97 I tried running this method on the large dataset (2MM row) the OP offered, and needed to eventually interrupt it so I could get my console back: system.time({ + ddply(exampledata2, .(orderID), function(x){ + data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x $itemPrice)) + }) + }) Timing stopped at: 808.473 1013.749 1816.125 The same task with ave() took 35 seconds. -- david. Best regards, Thierry -Oorspronkelijk bericht- Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org ] Namens Caroline Faisst Verzonden: woensdag 3 augustus 2011 15:26 Aan: r-help@r-project.org Onderwerp: [R] slow computation of functions over large datasets Hello there, I'm computing the total value of an order from the price of the order items using a for loop and the ifelse function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata- data .frame (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]- ifelse (exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i- 1,orderAmount] +exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2- data .frame (orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,1 0,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]- ifelse(exampledata2[i,orderID]==exampledata2[i- 1,orderID],exampledata2[i- 1,orderAmount] +exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
This takes about 2 secs for 1M rows: n - 100 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt - data.table(exampledata) system.time(result - ed.dt[ + , list(total = sum(itemPrice)) + , by = orderID + ] +) user system elapsed 1.300.051.34 str(result) Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ... head(result) orderID total [1,] 149 [2,] 237 [3,] 372 [4,] 492 [5,] 550 [6,] 676 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst caroline.fai...@gmail.com wrote: Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
On Aug 3, 2011, at 12:20 PM, jim holtman wrote: This takes about 2 secs for 1M rows: n - 100 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt - data.table(exampledata) system.time(result - ed.dt[ + , list(total = sum(itemPrice)) + , by = orderID + ] +) user system elapsed 1.300.051.34 Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David. str(result) Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ... head(result) orderID total [1,] 149 [2,] 237 [3,] 372 [4,] 492 [5,] 550 [6,] 676 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst caroline.fai...@gmail.com wrote: Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata- data .frame (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]- ifelse (exampledata [i ,orderID ]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] +exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2- data .frame (orderID = c (1,1,1,2,2,3,3,3,4,5 :200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]- ifelse (exampledata2 [i ,orderID ]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] +exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
On Aug 3, 2011, at 2:01 PM, Ken wrote: Hello, Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header. -Ken Hutchison Got any code? The OP offered a reproducible example, after all. -- David. On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net wrote: On Aug 3, 2011, at 12:20 PM, jim holtman wrote: This takes about 2 secs for 1M rows: n - 100 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt - data.table(exampledata) system.time(result - ed.dt[ + , list(total = sum(itemPrice)) + , by = orderID + ] +) user system elapsed 1.300.051.34 Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David. str(result) Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ... head(result) orderID total [1,] 149 [2,] 237 [3,] 372 [4,] 492 [5,] 550 [6,] 676 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst caroline.fai...@gmail.com wrote: Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata- data .frame (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]- ifelse (exampledata [i ,orderID ]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] +exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2- data .frame (orderID = c (1,1,1,2,2,3,3,3,4,5 :200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]- ifelse (exampledata2 [i ,orderID ]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] +exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
Sorry about the lack of code, but using Davids example, would: tapply(itemPrice, INDEX=orderID, FUN=sum) work? -Ken Hutchison On Aug 3, 2554 BE, at 2:09 PM, David Winsemius dwinsem...@comcast.net wrote: On Aug 3, 2011, at 2:01 PM, Ken wrote: Hello, Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header. -Ken Hutchison Got any code? The OP offered a reproducible example, after all. -- David. On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net wrote: On Aug 3, 2011, at 12:20 PM, jim holtman wrote: This takes about 2 secs for 1M rows: n - 100 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt - data.table(exampledata) system.time(result - ed.dt[ + , list(total = sum(itemPrice)) + , by = orderID + ] +) user system elapsed 1.300.051.34 Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David. str(result) Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ... head(result) orderID total [1,] 149 [2,] 237 [3,] 372 [4,] 492 [5,] 550 [6,] 676 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst caroline.fai...@gmail.com wrote: Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
Hello, Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header. -Ken Hutchison On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net wrote: On Aug 3, 2011, at 12:20 PM, jim holtman wrote: This takes about 2 secs for 1M rows: n - 100 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt - data.table(exampledata) system.time(result - ed.dt[ + , list(total = sum(itemPrice)) + , by = orderID + ] +) user system elapsed 1.300.051.34 Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David. str(result) Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ... head(result) orderID total [1,] 149 [2,] 237 [3,] 372 [4,] 492 [5,] 550 [6,] 676 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst caroline.fai...@gmail.com wrote: Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] slow computation of functions over large datasets
On Aug 3, 2011, at 3:05 PM, Ken wrote: Sorry about the lack of code, but using Davids example, would: tapply(itemPrice, INDEX=orderID, FUN=sum) work? Doesn't do the cumulative sums or the assignment into column of the same data.frame. That's why I used ave, because it keeps the sequence correct. -- David. -Ken Hutchison On Aug 3, 2554 BE, at 2:09 PM, David Winsemius dwinsem...@comcast.net wrote: On Aug 3, 2011, at 2:01 PM, Ken wrote: Hello, Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header. -Ken Hutchison Got any code? The OP offered a reproducible example, after all. -- David. On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net wrote: On Aug 3, 2011, at 12:20 PM, jim holtman wrote: This takes about 2 secs for 1M rows: n - 100 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt - data.table(exampledata) system.time(result - ed.dt[ + , list(total = sum(itemPrice)) + , by = orderID + ] +) user system elapsed 1.300.051.34 Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David. str(result) Classes ‘data.table’ and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ... head(result) orderID total [1,] 149 [2,] 237 [3,] 372 [4,] 492 [5,] 550 [6,] 676 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst caroline.fai...@gmail.com wrote: Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata- data .frame (orderID =c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,orderAmount]-exampledata[1,itemPrice] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,orderAmount]- ifelse (exampledata [i ,orderID ]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] +exampledata[i,itemPrice],exampledata[i,itemPrice])}) # large dataset: the very same computational task takes much longer exampledata2- data .frame (orderID = c (1,1,1,2,2,3,3,3,4,5 :200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220)) exampledata2[1,orderAmount]-exampledata2[1,itemPrice] system.time(for (i in 2:9) {exampledata2[i,orderAmount]- ifelse (exampledata2 [i ,orderID ]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] +exampledata2[i,itemPrice],exampledata2[i,itemPrice])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.