Hi, community I have an application which I try to migrate from MR to Spark. It will do some calculations from Hive and output to hfile which will be bulk load to HBase Table, details as follow:
Rdd<Element> input = getSourceInputFromHive() Rdd<Tuple2<byte[], byte[]>> mapSideResult = input.glom().mapPartitions(/*some calculation*/) // PS: the result in each partition has already been sorted according to the lexicographical order during the calculation mapSideResult.reduceByKey(/*some aggregations*/).sortByKey(/**/).map(/*transform Tuple2<byte[], byte[]> to Tuple2<ImmutableBytesWritable, KeyValue>*/).saveAsNewAPIHadoopFile(/*write to hfile*/) *Here is the problem, as in MR, in the reducer side, the mapper output has already been sorted, so that it is a merge sort which makes writing to hfile is sequential and fast.* * However in Spark, the output of reduceByKey phase has been shuffled, so I have to sort the rdd in order to write hfile which makes it slower 2x running on Spark than on MR.* * I am wondering that, if there is anything I can leverage has the same effect as MR. I happen to see a JIRA ticket https://issues.apache.org/jira/browse/SPARK-2926 <https://issues.apache.org/jira/browse/SPARK-2926>. Is it related to what I am looking for?*