So sorry about teasing you with the Scala. But the method is there in Java too, I just checked.
On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen <v...@paxata.com> wrote: > It might not be the same as a real hadoop reducer, but I think it would > accomplish the same. Take a look at: > > import org.apache.spark.SparkContext._ > // val rdd: RDD[(K, V)] > // def zero(value: V): S > // def reduce(agg: S, value: V): S > // def merge(agg1: S, agg2: S): S > val reducedUnsorted: RDD[(K, S)] = rdd.combineByKey[Int](zero, reduce, > merge) > reducedUnsorted.sortByKey() > > On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> I am struggling to reproduce the functionality of a Hadoop reducer on >> Spark (in Java) >> >> in Hadoop I have a function >> public void doReduce(K key, Iterator<V> values) >> in Hadoop there is also a consumer (context write) which can be seen as >> consume(key,value) >> >> In my code >> 1) knowing the key is important to the function >> 2) there is neither one output tuple2 per key nor one output tuple2 per >> value >> 3) the number of values per key might be large enough that storing them >> in memory is impractical >> 4) keys must appear in sorted order >> >> one good example would run through a large document using a similarity >> function to look at the last 200 lines and output any of those with a >> similarity of more than 0.3 (do not suggest output all and filter - the >> real problem is more complex) the critical concern is an uncertain number >> of tuples per key. >> >> my questions >> 1) how can this be done - ideally a consumer would be a JavaPairRDD but I >> don't see how to create one and add items later >> >> 2) how do I handle the entire partition, sort, process (involving calls >> to doReduce) process >> >> >> > >