i currently typically do something like this:

scala> val rdd = sc.parallelize(1 to 10)
scala> import com.twitter.algebird.Operators._
scala> import com.twitter.algebird.{Max, Min}
scala> rdd.map{ x => (
     |   1L,
     |   Min(x),
     |   Max(x),
     |   x
     | )}.reduce(_ + _)
res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
Int) = (10,Min(1),Max(10),55)

however for this you need twitter algebird dependency. without that you
have to code the reduce function on the tuples yourself...

another example with 2 columns, where i do conditional count for first
column, and simple sum for second:
scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
     |   if (x > 5) 1 else 0,
     |   y
     | )}.reduce(_ + _)
res3: (Int, Int) = (5,155)



On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebel...@gmail.com>wrote:

> Hi Koert, Patrick,
>
> do you already have an elegant solution to combine multiple operations on
> a single RDD?
> Say for example that I want to do a sum over one column, a count and an
> average over another column,
>
> thanks in advance,
> Richard
>
>
> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling 
> <rsiebel...@gmail.com>wrote:
>
>> Patrick, Koert,
>>
>> I'm also very interested in these examples, could you please post them if
>> you find them?
>> thanks in advance,
>> Richard
>>
>>
>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> not that long ago there was a nice example on here about how to combine
>>> multiple operations on a single RDD. so basically if you want to do a
>>> count() and something else, how to roll them into a single job. i think
>>> patrick wendell gave the examples.
>>>
>>> i cant find them anymore.... patrick can you please repost? thanks!
>>>
>>
>>
>

Reply via email to