Hey All, I think the old thread is here: https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
The method proposed in that thread is to create a utility class for doing single-pass aggregations. Using Algebird is a pretty good way to do this and is a bit more flexible since you don't need to create a new utility each time you want to do this. In Spark 1.0 and later you will be able to do this more elegantly with the schema support: myRDD.groupBy('user).select(Sum('clicks) as 'clicks, Average('duration) as 'duration) and it will use a single pass automatically... but that's not quite released yet :) - Patrick On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers <ko...@tresata.com> wrote: > i currently typically do something like this: > > scala> val rdd = sc.parallelize(1 to 10) > scala> import com.twitter.algebird.Operators._ > scala> import com.twitter.algebird.{Max, Min} > scala> rdd.map{ x => ( > | 1L, > | Min(x), > | Max(x), > | x > | )}.reduce(_ + _) > res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int], > Int) = (10,Min(1),Max(10),55) > > however for this you need twitter algebird dependency. without that you have > to code the reduce function on the tuples yourself... > > another example with 2 columns, where i do conditional count for first > column, and simple sum for second: > scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => ( > | if (x > 5) 1 else 0, > | y > | )}.reduce(_ + _) > res3: (Int, Int) = (5,155) > > > > On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebel...@gmail.com> > wrote: >> >> Hi Koert, Patrick, >> >> do you already have an elegant solution to combine multiple operations on >> a single RDD? >> Say for example that I want to do a sum over one column, a count and an >> average over another column, >> >> thanks in advance, >> Richard >> >> >> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <rsiebel...@gmail.com> >> wrote: >>> >>> Patrick, Koert, >>> >>> I'm also very interested in these examples, could you please post them if >>> you find them? >>> thanks in advance, >>> Richard >>> >>> >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote: >>>> >>>> not that long ago there was a nice example on here about how to combine >>>> multiple operations on a single RDD. so basically if you want to do a >>>> count() and something else, how to roll them into a single job. i think >>>> patrick wendell gave the examples. >>>> >>>> i cant find them anymore.... patrick can you please repost? thanks! >>> >>> >> >