Hi, Shao & Pendey
After repartition and sort within the partition, the application
running on Spark is now faster than on MR. I will try to run it on a much
larger dataset for benchmark.
Thanks again for the guidance.
周千昊 于2015年9月11日周五 下午1:35写道:
> Hi, Shao & Pendey
> Thanks for
Hi, Shao & Pendey
Thanks for tips. I will try to workaround this.
Saisai Shao 于2015年9月11日周五 下午1:23写道:
> Hi Qianhao,
>
> I think you could sort the data by yourself if you want achieve the same
> result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each
> partition). Do not
Hi Qianhao,
I think you could sort the data by yourself if you want achieve the same
result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each
partition). Do not call sortByKey again since it will introduce another
shuffle (that's the reason why it is slower than MR).
The problem
In mr jobs, the output is sorted only within reducer.. That can be better
emulated by sorting each partition of rdd rather than total sorting the
rdd..
In Rdd.mapPartition you can sort the data in one partition and try...
On Sep 11, 2015 7:36 AM, "周千昊" wrote:
> Hi, all
> Can anyone give some
Hi, all
Can anyone give some tips about this issue?
周千昊 于2015年9月8日周二 下午4:46写道:
> Hi, community
> I have an application which I try to migrate from MR to Spark.
> It will do some calculations from Hive and output to hfile which will
> be bulk load to HBase Table, details as follow:
Hi, community
I have an application which I try to migrate from MR to Spark.
It will do some calculations from Hive and output to hfile which will
be bulk load to HBase Table, details as follow:
Rdd input = getSourceInputFromHive()
Rdd> mapSideResult =
input.glom().mapPartition