Hi, community
     I have an application which I try to migrate from MR to Spark.
     It will do some calculations from Hive and output to hfile which will
be bulk load to HBase Table, details as follow:

     Rdd<Element> input = getSourceInputFromHive()
     Rdd<Tuple2<byte[], byte[]>> mapSideResult =
input.glom().mapPartitions(/*some calculation*/)
     // PS: the result in each partition has already been sorted according
to the lexicographical order during the calculation
     mapSideResult.reduceByKey(/*some
aggregations*/).sortByKey(/**/).map(/*transform Tuple2<byte[], byte[]> to
Tuple2<ImmutableBytesWritable, KeyValue>*/).saveAsNewAPIHadoopFile(/*write
to hfile*/)

      *Here is the problem, as in MR, in the reducer side, the mapper
output has already been sorted, so that it is a merge sort which makes
writing to hfile is sequential and fast.*
*      However in Spark, the output of reduceByKey phase has been shuffled,
so I have to sort the rdd in order to write hfile which makes it slower 2x
running on Spark than on MR.*
*      I am wondering that, if there is anything I can leverage has the
same effect as MR. I happen to see a JIRA
ticket https://issues.apache.org/jira/browse/SPARK-2926
<https://issues.apache.org/jira/browse/SPARK-2926>. Is it related to what I
am looking for?*

Reply via email to