Hi, all
     Can anyone give some tips about this issue?

周千昊 <qhz...@apache.org>于2015年9月8日周二 下午4:46写道:

> Hi, community
>      I have an application which I try to migrate from MR to Spark.
>      It will do some calculations from Hive and output to hfile which will
> be bulk load to HBase Table, details as follow:
>
>      Rdd<Element> input = getSourceInputFromHive()
>      Rdd<Tuple2<byte[], byte[]>> mapSideResult =
> input.glom().mapPartitions(/*some calculation*/)
>      // PS: the result in each partition has already been sorted according
> to the lexicographical order during the calculation
>      mapSideResult.reduceByKey(/*some
> aggregations*/).sortByKey(/**/).map(/*transform Tuple2<byte[], byte[]> to
> Tuple2<ImmutableBytesWritable, KeyValue>*/).saveAsNewAPIHadoopFile(/*write
> to hfile*/)
>
>       *Here is the problem, as in MR, in the reducer side, the mapper
> output has already been sorted, so that it is a merge sort which makes
> writing to hfile is sequential and fast.*
> *      However in Spark, the output of reduceByKey phase has been
> shuffled, so I have to sort the rdd in order to write hfile which makes it
> slower 2x running on Spark than on MR.*
> *      I am wondering that, if there is anything I can leverage has the
> same effect as MR. I happen to see a JIRA
> ticket https://issues.apache.org/jira/browse/SPARK-2926
> <https://issues.apache.org/jira/browse/SPARK-2926>. Is it related to what I
> am looking for?*
>
-- 
Best Regard
ZhouQianhao

Reply via email to