Re: Why repartitionAndSortWithinPartitions slower than MapReducer
I assume you are using RDDs? What are you doing after the repartitioning + sorting, if anything? On Aug 20, 2018 11:22, "周浥尘" wrote: In addition to my previous email, Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6 周浥尘 于2018年8月20日周一 下午8:52写道: > Hi team, > > I found the Spark method *repartitionAndSortWithinPartitions *spends > twice as much time as using Mapreduce in some cases. > I want to repartition the dataset accorading to split keys and save them > to files in ascending. As the doc says, repartitionAndSortWithinPartitions > “is more efficient than calling `repartition` and then sorting within each > partition because it can push the sorting down into the shuffle machinery.” > I thought it may be faster than MR, but actually, it is much more slower. I > also adjust several configurations of spark, but that doesn't work.(Both > Spark and Mapreduce run on a three-node cluster and share the same number > of partitions.) > Can this situation be explained or is there any approach to improve the > performance of spark? > > Thanks & Regards, > Yichen >
Re: Why repartitionAndSortWithinPartitions slower than MapReducer
In addition to my previous email, Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6 周浥尘 于2018年8月20日周一 下午8:52写道: > Hi team, > > I found the Spark method *repartitionAndSortWithinPartitions *spends > twice as much time as using Mapreduce in some cases. > I want to repartition the dataset accorading to split keys and save them > to files in ascending. As the doc says, > repartitionAndSortWithinPartitions “is more efficient than calling > `repartition` and then sorting within each partition because it can push > the sorting down into the shuffle machinery.” I thought it may be faster > than MR, but actually, it is much more slower. I also adjust several > configurations of spark, but that doesn't work.(Both Spark and Mapreduce > run on a three-node cluster and share the same number of partitions.) > Can this situation be explained or is there any approach to improve the > performance of spark? > > Thanks & Regards, > Yichen >