Re: [R] slow computation of functions over large datasets

2011-08-04 Thread Paul Hiemstra
 Hi all,

After reading this interesting discussion I delved a bit deeper into the
subject matter. The following snippet of code (see the end of my mail)
compares three ways of performing this task, using ddply, ave and one
yet unmentioned option: data.table (a package). The piece of code
generates mock datasets which vary in size and number of factor levels
for the factor. The results look like this (there is also a ggplot plot
in the script that summarise the table):

 res
   datsize noClasses   tave  tddply tdata.table
...note that I cut out part of the table for readability...
17   1e+0710  9.160   3.500   1.064
18   1e+0750 10.126   4.483   1.364
19   1e+07   100 10.485   5.016   1.407
20   1e+07   200 10.680   6.901   1.435
21   1e+07   500 10.801  12.569   1.474
22   1e+07  1000 10.923  21.001   1.540
23   1e+07  2500 11.514  51.020   1.622
24   1e+07 1 12.158 182.752   1.737

It is clear that the option of using data.table is by far the fastest of
the three and scales quite nicely with the number of factor levels, in
contrast to ddply. It is also faster than ave by up to a factor of 10.

cheers,
Paul

library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5, 10e6)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
  cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
  expdataDT = data.table(expdata)
  t1 = system.time(res1 - with(expdata, ave(value, cat, FUN = sum)))
  t2 = system.time(res2 - ddply(expdata, .(cat), summarise, val =
sum(value)))
  t3 = system.time(res3 - expdataDT[, sum(value), by = cat])
  return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
res

ggplot(aes(x = noClasses, y = log(value), color = variable),
data = melt(res, id.vars = c(datsize,noClasses))) +
facet_wrap(~ datsize) + geom_line()

 sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C 
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C   
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C  

attached base packages:
[1] grid  stats graphics  grDevices utils datasets  methods 
[8] base

other attached packages:
[1] data.table_1.6.3 ggplot2_0.8.9proto_0.3-8  reshape_0.8.4  
[5] plyr_1.5.2   fortunes_1.4-1 

loaded via a namespace (and not attached):
[1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0


On 08/03/2011 01:25 PM, Caroline Faisst wrote:
 Hello there,


 I'm computing the total value of an order from the price of the order items
 using a for loop and the ifelse function. I do this on a large dataframe
 (close to 1m lines). The computation of this function is painfully slow: in
 1min only about 90 rows are calculated.


 The computation time taken for a given number of rows increases with the
 size of the dataset, see the example with my function below:


 # small dataset: function performs well

 exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

 exampledata[1,orderAmount]-exampledata[1,itemPrice]

 system.time(for (i in 2:length(exampledata[,1]))
 {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])})


 # large dataset: the very same computational task takes much longer

 exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))

 exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

 system.time(for (i in 2:9)
 {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})



 Does someone know a way to increase the speed?


 Thank you very much!

 Caroline

   [[alternative HTML version deleted]]



 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770


[[alternative HTML version deleted]]

__
R-help@r-project.org 

Re: [R] slow computation of functions over large datasets

2011-08-03 Thread ONKELINX, Thierry
Dear Caroline,

Here is a faster and more elegant solution.

 n - 1
 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), 
 itemPrice = rpois(n, 10))
 library(plyr)
 system.time({
+   ddply(exampledata, .(orderID), function(x){
+   data.frame(itemPrice = x$itemPrice, orderAmount = 
cumsum(x$itemPrice))
+   })
+ })
   user  system elapsed 
   1.670.001.69 
 exampledata[1,orderAmount]-exampledata[1,itemPrice]
 system.time(for (i in 2:length(exampledata[,1]))
+ 
{exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])})
   user  system elapsed 
  11.940.02   11.97

Best regards,

Thierry
 -Oorspronkelijk bericht-
 Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
 Namens Caroline Faisst
 Verzonden: woensdag 3 augustus 2011 15:26
 Aan: r-help@r-project.org
 Onderwerp: [R] slow computation of functions over large datasets
 
 Hello there,
 
 
 I'm computing the total value of an order from the price of the order items 
 using
 a for loop and the ifelse function. I do this on a large dataframe (close 
 to
 1m lines). The computation of this function is painfully slow: in 1min only 
 about
 90 rows are calculated.
 
 
 The computation time taken for a given number of rows increases with the size
 of the dataset, see the example with my function below:
 
 
 # small dataset: function performs well
 
 exampledata-
 data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
 
 exampledata[1,orderAmount]-exampledata[1,itemPrice]
 
 system.time(for (i in 2:length(exampledata[,1]))
 {exampledata[i,orderAmount]-
 ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-
 1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])})
 
 
 # large dataset: the very same computational task takes much longer
 
 exampledata2-
 data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,1
 0,1,9,7,25:220))
 
 exampledata2[1,orderAmount]-exampledata2[1,itemPrice]
 
 system.time(for (i in 2:9)
 {exampledata2[i,orderAmount]-
 ifelse(exampledata2[i,orderID]==exampledata2[i-
 1,orderID],exampledata2[i-
 1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})
 
 
 
 Does someone know a way to increase the speed?
 
 
 Thank you very much!
 
 Caroline
 
   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread David Winsemius


On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:


Hello there,


I’m computing the total value of an order from the price of the  
order items

using a “for” loop and the “ifelse” function.


Ouch. Schools really should stop teaching SAS and BASIC as a first  
language.



I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully  
slow: in

1min only about 90 rows are calculated.


The computation time taken for a given number of rows increases with  
the

size of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata- 
data 
.frame 
(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))


exampledata[1,orderAmount]-exampledata[1,itemPrice]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,orderAmount]- 
ifelse 
(exampledata 
[i 
,orderID 
]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] 
+exampledata[i,itemPrice],exampledata[i,itemPrice])})


Try instead using 'ave' to calculate a cumulative 'sum' within  
orderID:


exampledata$orderAmt - with(exampledata,  ave(itemPrice, orderID,  
FUN=cumsum) )


I assure you this will be more reproducible,  faster, and  
understandable.



# large dataset:


medium dataset really. Barely nudges the RAM dial on my machine.


the very same computational task takes much longer

exampledata2- 
data 
.frame 
(orderID 
= 
c 
(1,1,1,2,2,3,3,3,4,5 
:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))


exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

system.time(for (i in 2:9)
{exampledata2[i,orderAmount]- 
ifelse 
(exampledata2 
[i 
,orderID 
]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] 
+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})



 system.time( exampledata2$orderAmt - with(exampledata2,   
ave(itemPrice, orderID, FUN=cumsum) ) )

   user  system elapsed
 35.106   0.811  35.822

On a three year-old machine. Not as fast as I expected, but not long  
enough to require refilling the coffee cup either.


--
David.


Does someone know a way to increase the speed?



--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread David Winsemius


On Aug 3, 2011, at 9:59 AM, ONKELINX, Thierry wrote:


Dear Caroline,

Here is a faster and more elegant solution.


n - 1
exampledata - data.frame(orderID = sample(floor(n / 5), n, replace  
= TRUE), itemPrice = rpois(n, 10))

library(plyr)
system.time({

+   ddply(exampledata, .(orderID), function(x){
+ 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
$itemPrice))

+   })
+ })
  user  system elapsed
  1.670.001.69

exampledata[1,orderAmount]-exampledata[1,itemPrice]
system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,orderAmount]- 
ifelse 
(exampledata 
[i 
,orderID 
]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] 
+exampledata[i,itemPrice],exampledata[i,itemPrice])})

  user  system elapsed
 11.940.02   11.97


I tried running this method on the large dataset (2MM row) the OP  
offered, and needed to eventually interrupt it so I could get my  
console back:


 system.time({
+   ddply(exampledata2, .(orderID), function(x){
+  		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
$itemPrice))

+   })
+  })

Timing stopped at: 808.473 1013.749 1816.125

The same task with ave() took 35 seconds.

--
david.



Best regards,

Thierry

-Oorspronkelijk bericht-
Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org 
]

Namens Caroline Faisst
Verzonden: woensdag 3 augustus 2011 15:26
Aan: r-help@r-project.org
Onderwerp: [R] slow computation of functions over large datasets

Hello there,


I'm computing the total value of an order from the price of the  
order items using
a for loop and the ifelse function. I do this on a large  
dataframe (close to
1m lines). The computation of this function is painfully slow: in  
1min only about

90 rows are calculated.


The computation time taken for a given number of rows increases  
with the size

of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata-
data 
.frame 
(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))


exampledata[1,orderAmount]-exampledata[1,itemPrice]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,orderAmount]-
ifelse 
(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-
1,orderAmount] 
+exampledata[i,itemPrice],exampledata[i,itemPrice])})


# large dataset: the very same computational task takes much longer

exampledata2-
data 
.frame 
(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,1

0,1,9,7,25:220))

exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

system.time(for (i in 2:9)
{exampledata2[i,orderAmount]-
ifelse(exampledata2[i,orderID]==exampledata2[i-
1,orderID],exampledata2[i-
1,orderAmount] 
+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})


Does someone know a way to increase the speed?


Thank you very much!

Caroline


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread jim holtman
This takes about 2 secs for 1M rows:

 n - 100
 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), 
 itemPrice = rpois(n, 10))
 require(data.table)
 # convert to data.table
 ed.dt - data.table(exampledata)
 system.time(result - ed.dt[
+ , list(total = sum(itemPrice))
+ , by = orderID
+ ]
+)
   user  system elapsed
   1.300.051.34

 str(result)
Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
 head(result)
 orderID total
[1,]   149
[2,]   237
[3,]   372
[4,]   492
[5,]   550
[6,]   676



On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
caroline.fai...@gmail.com wrote:
 Hello there,


 I’m computing the total value of an order from the price of the order items
 using a “for” loop and the “ifelse” function. I do this on a large dataframe
 (close to 1m lines). The computation of this function is painfully slow: in
 1min only about 90 rows are calculated.


 The computation time taken for a given number of rows increases with the
 size of the dataset, see the example with my function below:


 # small dataset: function performs well

 exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

 exampledata[1,orderAmount]-exampledata[1,itemPrice]

 system.time(for (i in 2:length(exampledata[,1]))
 {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])})


 # large dataset: the very same computational task takes much longer

 exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))

 exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

 system.time(for (i in 2:9)
 {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})



 Does someone know a way to increase the speed?


 Thank you very much!

 Caroline

        [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread David Winsemius


On Aug 3, 2011, at 12:20 PM, jim holtman wrote:


This takes about 2 secs for 1M rows:


n - 100
exampledata - data.frame(orderID = sample(floor(n / 5), n, replace  
= TRUE), itemPrice = rpois(n, 10))

require(data.table)
# convert to data.table
ed.dt - data.table(exampledata)
system.time(result - ed.dt[

+ , list(total = sum(itemPrice))
+ , by = orderID
+ ]
+)
  user  system elapsed
  1.300.051.34


Interesting. Impressive. And I noted that the OP wanted what cumsum  
would provide and for some reason creating that longer result is even  
faster on my machine than the shorter result using sum.


--
David.


str(result)

Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
$ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
$ total  : num  49 37 72 92 50 76 34 22 65 39 ...

head(result)

orderID total
[1,]   149
[2,]   237
[3,]   372
[4,]   492
[5,]   550
[6,]   676





On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
caroline.fai...@gmail.com wrote:

Hello there,


I’m computing the total value of an order from the price of the  
order items
using a “for” loop and the “ifelse” function. I do this on a large  
dataframe
(close to 1m lines). The computation of this function is painfully  
slow: in

1min only about 90 rows are calculated.


The computation time taken for a given number of rows increases  
with the

size of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata- 
data 
.frame 
(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))


exampledata[1,orderAmount]-exampledata[1,itemPrice]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,orderAmount]- 
ifelse 
(exampledata 
[i 
,orderID 
]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] 
+exampledata[i,itemPrice],exampledata[i,itemPrice])})



# large dataset: the very same computational task takes much longer

exampledata2- 
data 
.frame 
(orderID 
= 
c 
(1,1,1,2,2,3,3,3,4,5 
:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))


exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

system.time(for (i in 2:9)
{exampledata2[i,orderAmount]- 
ifelse 
(exampledata2 
[i 
,orderID 
]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] 
+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})




Does someone know a way to increase the speed?


Thank you very much!

Caroline

   [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.






--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread David Winsemius


On Aug 3, 2011, at 2:01 PM, Ken wrote:


Hello,
 Perhaps transpose the table attach(as.data.frame(t(data))) and use  
ColSums() function with order id as header.

-Ken Hutchison


 Got any code? The OP offered a reproducible example, after all.

--
David.


On Aug 3, 2554 BE, at 1:12 PM, David Winsemius  
dwinsem...@comcast.net wrote:




On Aug 3, 2011, at 12:20 PM, jim holtman wrote:


This takes about 2 secs for 1M rows:


n - 100
exampledata - data.frame(orderID = sample(floor(n / 5), n,  
replace = TRUE), itemPrice = rpois(n, 10))

require(data.table)
# convert to data.table
ed.dt - data.table(exampledata)
system.time(result - ed.dt[

+ , list(total = sum(itemPrice))
+ , by = orderID
+ ]
+)
user  system elapsed
1.300.051.34


Interesting. Impressive. And I noted that the OP wanted what cumsum  
would provide and for some reason creating that longer result is  
even faster on my machine than the shorter result using sum.


--
David.


str(result)

Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
$ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
$ total  : num  49 37 72 92 50 76 34 22 65 39 ...

head(result)

  orderID total
[1,]   149
[2,]   237
[3,]   372
[4,]   492
[5,]   550
[6,]   676





On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
caroline.fai...@gmail.com wrote:

Hello there,


I’m computing the total value of an order from the price of the  
order items
using a “for” loop and the “ifelse” function. I do this on a  
large dataframe
(close to 1m lines). The computation of this function is  
painfully slow: in

1min only about 90 rows are calculated.


The computation time taken for a given number of rows increases  
with the

size of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata- 
data 
.frame 
(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))


exampledata[1,orderAmount]-exampledata[1,itemPrice]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,orderAmount]- 
ifelse 
(exampledata 
[i 
,orderID 
]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] 
+exampledata[i,itemPrice],exampledata[i,itemPrice])})



# large dataset: the very same computational task takes much longer

exampledata2- 
data 
.frame 
(orderID 
= 
c 
(1,1,1,2,2,3,3,3,4,5 
:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))


exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

system.time(for (i in 2:9)
{exampledata2[i,orderAmount]- 
ifelse 
(exampledata2 
[i 
,orderID 
]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] 
+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})




Does someone know a way to increase the speed?


Thank you very much!

Caroline

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.






--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread Ken
Sorry about the lack of code, but using Davids example, would:
tapply(itemPrice, INDEX=orderID, FUN=sum)
work?
  -Ken Hutchison

On Aug 3, 2554 BE, at 2:09 PM, David Winsemius dwinsem...@comcast.net wrote:

 
 On Aug 3, 2011, at 2:01 PM, Ken wrote:
 
 Hello,
 Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() 
 function with order id as header.
-Ken Hutchison
 
 Got any code? The OP offered a reproducible example, after all.
 
 -- 
 David.
 
 On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net 
 wrote:
 
 
 On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
 
 This takes about 2 secs for 1M rows:
 
 n - 100
 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = 
 TRUE), itemPrice = rpois(n, 10))
 require(data.table)
 # convert to data.table
 ed.dt - data.table(exampledata)
 system.time(result - ed.dt[
 + , list(total = sum(itemPrice))
 + , by = orderID
 + ]
 +)
 user  system elapsed
 1.300.051.34
 
 Interesting. Impressive. And I noted that the OP wanted what cumsum would 
 provide and for some reason creating that longer result is even faster on 
 my machine than the shorter result using sum.
 
 -- 
 David.
 
 str(result)
 Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
 head(result)
  orderID total
 [1,]   149
 [2,]   237
 [3,]   372
 [4,]   492
 [5,]   550
 [6,]   676
 
 
 
 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
 caroline.fai...@gmail.com wrote:
 Hello there,
 
 
 I’m computing the total value of an order from the price of the order 
 items
 using a “for” loop and the “ifelse” function. I do this on a large 
 dataframe
 (close to 1m lines). The computation of this function is painfully slow: 
 in
 1min only about 90 rows are calculated.
 
 
 The computation time taken for a given number of rows increases with the
 size of the dataset, see the example with my function below:
 
 
 # small dataset: function performs well
 
 exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
 
 exampledata[1,orderAmount]-exampledata[1,itemPrice]
 
 system.time(for (i in 2:length(exampledata[,1]))
 {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])})
 
 
 # large dataset: the very same computational task takes much longer
 
 exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))
 
 exampledata2[1,orderAmount]-exampledata2[1,itemPrice]
 
 system.time(for (i in 2:9)
 {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})
 
 
 
 Does someone know a way to increase the speed?
 
 
 Thank you very much!
 
 Caroline
 
 [[alternative HTML version deleted]]
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 
 -- 
 Jim Holtman
 Data Munger Guru
 
 What is the problem that you are trying to solve?
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 David Winsemius, MD
 West Hartford, CT
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 David Winsemius, MD
 West Hartford, CT
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread Ken
Hello, 
  Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() 
function with order id as header.
 -Ken Hutchison

On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net wrote:

 
 On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
 
 This takes about 2 secs for 1M rows:
 
 n - 100
 exampledata - data.frame(orderID = sample(floor(n / 5), n, replace = 
 TRUE), itemPrice = rpois(n, 10))
 require(data.table)
 # convert to data.table
 ed.dt - data.table(exampledata)
 system.time(result - ed.dt[
 + , list(total = sum(itemPrice))
 + , by = orderID
 + ]
 +)
  user  system elapsed
  1.300.051.34
 
 Interesting. Impressive. And I noted that the OP wanted what cumsum would 
 provide and for some reason creating that longer result is even faster on my 
 machine than the shorter result using sum.
 
 -- 
 David.
 
 str(result)
 Classes ‘data.table’ and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
 head(result)
orderID total
 [1,]   149
 [2,]   237
 [3,]   372
 [4,]   492
 [5,]   550
 [6,]   676
 
 
 
 On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
 caroline.fai...@gmail.com wrote:
 Hello there,
 
 
 I’m computing the total value of an order from the price of the order items
 using a “for” loop and the “ifelse” function. I do this on a large dataframe
 (close to 1m lines). The computation of this function is painfully slow: in
 1min only about 90 rows are calculated.
 
 
 The computation time taken for a given number of rows increases with the
 size of the dataset, see the example with my function below:
 
 
 # small dataset: function performs well
 
 exampledata-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
 
 exampledata[1,orderAmount]-exampledata[1,itemPrice]
 
 system.time(for (i in 2:length(exampledata[,1]))
 {exampledata[i,orderAmount]-ifelse(exampledata[i,orderID]==exampledata[i-1,orderID],exampledata[i-1,orderAmount]+exampledata[i,itemPrice],exampledata[i,itemPrice])})
 
 
 # large dataset: the very same computational task takes much longer
 
 exampledata2-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))
 
 exampledata2[1,orderAmount]-exampledata2[1,itemPrice]
 
 system.time(for (i in 2:9)
 {exampledata2[i,orderAmount]-ifelse(exampledata2[i,orderID]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount]+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})
 
 
 
 Does someone know a way to increase the speed?
 
 
 Thank you very much!
 
 Caroline
 
   [[alternative HTML version deleted]]
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 
 -- 
 Jim Holtman
 Data Munger Guru
 
 What is the problem that you are trying to solve?
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 David Winsemius, MD
 West Hartford, CT
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] slow computation of functions over large datasets

2011-08-03 Thread David Winsemius


On Aug 3, 2011, at 3:05 PM, Ken wrote:


Sorry about the lack of code, but using Davids example, would:
tapply(itemPrice, INDEX=orderID, FUN=sum)
work?


Doesn't do the cumulative sums or the assignment into column of the  
same data.frame. That's why I used ave, because it keeps the sequence  
correct.


--
David.

 -Ken Hutchison

On Aug 3, 2554 BE, at 2:09 PM, David Winsemius  
dwinsem...@comcast.net wrote:




On Aug 3, 2011, at 2:01 PM, Ken wrote:


Hello,
Perhaps transpose the table attach(as.data.frame(t(data))) and use  
ColSums() function with order id as header.

  -Ken Hutchison


Got any code? The OP offered a reproducible example, after all.

--
David.


On Aug 3, 2554 BE, at 1:12 PM, David Winsemius dwinsem...@comcast.net 
 wrote:




On Aug 3, 2011, at 12:20 PM, jim holtman wrote:


This takes about 2 secs for 1M rows:


n - 100
exampledata - data.frame(orderID = sample(floor(n / 5), n,  
replace = TRUE), itemPrice = rpois(n, 10))

require(data.table)
# convert to data.table
ed.dt - data.table(exampledata)
system.time(result - ed.dt[

+ , list(total = sum(itemPrice))
+ , by = orderID
+ ]
+)
user  system elapsed
1.300.051.34


Interesting. Impressive. And I noted that the OP wanted what  
cumsum would provide and for some reason creating that longer  
result is even faster on my machine than the shorter result using  
sum.


--
David.


str(result)
Classes ‘data.table’ and 'data.frame':  198708 obs. of  2  
variables:

$ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
$ total  : num  49 37 72 92 50 76 34 22 65 39 ...

head(result)

orderID total
[1,]   149
[2,]   237
[3,]   372
[4,]   492
[5,]   550
[6,]   676





On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
caroline.fai...@gmail.com wrote:

Hello there,


I’m computing the total value of an order from the price of the  
order items
using a “for” loop and the “ifelse” function. I do this on a  
large dataframe
(close to 1m lines). The computation of this function is  
painfully slow: in

1min only about 90 rows are calculated.


The computation time taken for a given number of rows increases  
with the

size of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata- 
data 
.frame 
(orderID 
=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))


exampledata[1,orderAmount]-exampledata[1,itemPrice]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,orderAmount]- 
ifelse 
(exampledata 
[i 
,orderID 
]==exampledata[i-1,orderID],exampledata[i-1,orderAmount] 
+exampledata[i,itemPrice],exampledata[i,itemPrice])})



# large dataset: the very same computational task takes much  
longer


exampledata2- 
data 
.frame 
(orderID 
= 
c 
(1,1,1,2,2,3,3,3,4,5 
:200),itemPrice=c(10,17,9,12,25,10,1,9,7,25:220))


exampledata2[1,orderAmount]-exampledata2[1,itemPrice]

system.time(for (i in 2:9)
{exampledata2[i,orderAmount]- 
ifelse 
(exampledata2 
[i 
,orderID 
]==exampledata2[i-1,orderID],exampledata2[i-1,orderAmount] 
+exampledata2[i,itemPrice],exampledata2[i,itemPrice])})




Does someone know a way to increase the speed?


Thank you very much!

Caroline

   [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible  
code.







--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT



David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.