Hey Bipin, Thanks for the reply, I am actually aggregating after the groupByKey() operation, I have posted the wrong code snippet in my first email. Here is what I am doing
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).map(build_edges) Can we replace reduceByKey() in this context ? Santhosh On Mon, Aug 6, 2018 at 1:20 AM, Biplob Biswas <revolutioni...@gmail.com> wrote: > Hi Santhosh, > > If you are not performing any aggregation, then I don't think you can > replace your groupbykey with a reducebykey, and as I see you are only > grouping and taking 2 values of the result, thus I believe you can't just > replace your groupbykey with that. > > Thanks & Regards > Biplob Biswas > > > On Sat, Aug 4, 2018 at 12:05 AM Bathi CCDB <bathi.c...@gmail.com> wrote: > >> I am trying to replace groupByKey() with reudceByKey(), I am a pyspark >> and python newbie and I am having a hard time figuring out the lambda >> function for the reduceByKey() operation. >> >> Here is the code >> >> dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: >> (x[0],x)).groupByKey(25).take(2) >> >> Here is the return value >> >> >>> dd[(u'KEY_1', <pyspark.resultiterable.ResultIterable object at >> >>> 0x107be0c50>), (u'KEY_2', <pyspark.resultiterable.ResultIterable object >> >>> at 0x107be0c10>)] >> >> and Here are the iterable contents dd[0][1] >> >> Row(key=u'KEY_1', hash_fn=u'deec95d65ca6b3b4f2e1ef259040aa79', >> value=u'e7dc1f2a')Row(key=u'KEY_1', >> hash_fn=u'f8891048a9ef8331227b4af080ecd28a', >> value=u'fb0bc953').......Row(key=u'KEY_1', >> hash_fn=u'1b9d2bb2db28603ff21052efcd13f242', >> value=u'd39714d3')Row(key=u'KEY_1', >> hash_fn=u'c41b0269706ac423732a6bab24bf8a6a', value=u'ab58db92') >> >> My question is how do replace with reduceByKey() and get the same output >> as above? >> >> Santhosh >> >