Hi, all Can anyone give some tips about this issue? 周千昊 <qhz...@apache.org>于2015年9月8日周二 下午4:46写道:
> Hi, community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from Hive and output to hfile which will > be bulk load to HBase Table, details as follow: > > Rdd<Element> input = getSourceInputFromHive() > Rdd<Tuple2<byte[], byte[]>> mapSideResult = > input.glom().mapPartitions(/*some calculation*/) > // PS: the result in each partition has already been sorted according > to the lexicographical order during the calculation > mapSideResult.reduceByKey(/*some > aggregations*/).sortByKey(/**/).map(/*transform Tuple2<byte[], byte[]> to > Tuple2<ImmutableBytesWritable, KeyValue>*/).saveAsNewAPIHadoopFile(/*write > to hfile*/) > > *Here is the problem, as in MR, in the reducer side, the mapper > output has already been sorted, so that it is a merge sort which makes > writing to hfile is sequential and fast.* > * However in Spark, the output of reduceByKey phase has been > shuffled, so I have to sort the rdd in order to write hfile which makes it > slower 2x running on Spark than on MR.* > * I am wondering that, if there is anything I can leverage has the > same effect as MR. I happen to see a JIRA > ticket https://issues.apache.org/jira/browse/SPARK-2926 > <https://issues.apache.org/jira/browse/SPARK-2926>. Is it related to what I > am looking for?* > -- Best Regard ZhouQianhao