i currently typically do something like this: scala> val rdd = sc.parallelize(1 to 10) scala> import com.twitter.algebird.Operators._ scala> import com.twitter.algebird.{Max, Min} scala> rdd.map{ x => ( | 1L, | Min(x), | Max(x), | x | )}.reduce(_ + _) res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int], Int) = (10,Min(1),Max(10),55)
however for this you need twitter algebird dependency. without that you have to code the reduce function on the tuples yourself... another example with 2 columns, where i do conditional count for first column, and simple sum for second: scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => ( | if (x > 5) 1 else 0, | y | )}.reduce(_ + _) res3: (Int, Int) = (5,155) On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebel...@gmail.com>wrote: > Hi Koert, Patrick, > > do you already have an elegant solution to combine multiple operations on > a single RDD? > Say for example that I want to do a sum over one column, a count and an > average over another column, > > thanks in advance, > Richard > > > On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling > <rsiebel...@gmail.com>wrote: > >> Patrick, Koert, >> >> I'm also very interested in these examples, could you please post them if >> you find them? >> thanks in advance, >> Richard >> >> >> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> not that long ago there was a nice example on here about how to combine >>> multiple operations on a single RDD. so basically if you want to do a >>> count() and something else, how to roll them into a single job. i think >>> patrick wendell gave the examples. >>> >>> i cant find them anymore.... patrick can you please repost? thanks! >>> >> >> >