Re: Replacing groupBykey() with reduceByKey()

Bathi CCDB Mon, 06 Aug 2018 09:29:03 -0700

Hey Bipin,
Thanks for the reply, I am actually aggregating after the groupByKey()
operation,
I have posted the wrong code snippet in my first email. Here is what I am
doing


dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).map(build_edges)

Can we replace reduceByKey() in this context ?

Santhosh


On Mon, Aug 6, 2018 at 1:20 AM, Biplob Biswas <[email protected]>
wrote:

> Hi Santhosh,
>
> If you are not performing any aggregation, then I don't think you can
> replace your groupbykey with a reducebykey, and as I see you are only
> grouping and taking 2 values of the result, thus I believe you can't just
> replace your groupbykey with that.
>
> Thanks & Regards
> Biplob Biswas
>
>
> On Sat, Aug 4, 2018 at 12:05 AM Bathi CCDB <[email protected]> wrote:
>
>> I am trying to replace groupByKey() with reudceByKey(), I am a pyspark
>> and python newbie and I am having a hard time figuring out the lambda
>> function for the reduceByKey() operation.
>>
>> Here is the code
>>
>> dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: 
>> (x[0],x)).groupByKey(25).take(2)
>>
>> Here is the return value
>>
>> >>> dd[(u'KEY_1', <pyspark.resultiterable.ResultIterable object at 
>> >>> 0x107be0c50>), (u'KEY_2', <pyspark.resultiterable.ResultIterable object 
>> >>> at 0x107be0c10>)]
>>
>> and Here are the iterable contents dd[0][1]
>>
>> Row(key=u'KEY_1', hash_fn=u'deec95d65ca6b3b4f2e1ef259040aa79', 
>> value=u'e7dc1f2a')Row(key=u'KEY_1', 
>> hash_fn=u'f8891048a9ef8331227b4af080ecd28a', 
>> value=u'fb0bc953').......Row(key=u'KEY_1', 
>> hash_fn=u'1b9d2bb2db28603ff21052efcd13f242', 
>> value=u'd39714d3')Row(key=u'KEY_1', 
>> hash_fn=u'c41b0269706ac423732a6bab24bf8a6a', value=u'ab58db92')
>>
>> My question is how do replace with reduceByKey() and get the same output
>> as above?
>>
>> Santhosh
>>
>

Re: Replacing groupBykey() with reduceByKey()

Reply via email to