Re: Reproducing the function of a Hadoop Reducer

Victor Tso-Guillen Fri, 19 Sep 2014 23:20:45 -0700

So sorry about teasing you with the Scala. But the method is there in Java
too, I just checked.


On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen <v...@paxata.com> wrote:

> It might not be the same as a real hadoop reducer, but I think it would
> accomplish the same. Take a look at:
>
> import org.apache.spark.SparkContext._
> // val rdd: RDD[(K, V)]
> // def zero(value: V): S
> // def reduce(agg: S, value: V): S
> // def merge(agg1: S, agg2: S): S
> val reducedUnsorted: RDD[(K, S)] = rdd.combineByKey[Int](zero, reduce,
> merge)
> reducedUnsorted.sortByKey()
>
> On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>>  I am struggling to reproduce the functionality of a Hadoop reducer on
>> Spark (in Java)
>>
>> in Hadoop I have a function
>> public void doReduce(K key, Iterator<V> values)
>> in Hadoop there is also a consumer (context write) which can be seen as
>> consume(key,value)
>>
>> In my code
>> 1) knowing the key is important to the function
>> 2) there is neither one output tuple2 per key nor one output tuple2 per
>> value
>> 3) the number of values per key might be large enough that storing them
>> in memory is impractical
>> 4) keys must appear in sorted order
>>
>> one good example would run through a large document using a similarity
>> function to look at the last 200 lines and output any of those with a
>> similarity of more than 0.3 (do not suggest output all and filter - the
>> real problem is more complex) the critical concern is an uncertain number
>> of tuples per key.
>>
>> my questions
>> 1) how can this be done - ideally a consumer would be a JavaPairRDD but I
>> don't see how to create one and add items later
>>
>> 2) how do I handle the entire partition, sort, process (involving calls
>> to doReduce) process
>>
>>
>>
>
>

Re: Reproducing the function of a Hadoop Reducer

Reply via email to