[ https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mridul Muralidharan resolved SPARK-46512. ----------------------------------------- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44512 [https://github.com/apache/spark/pull/44512] > Optimize shuffle reading when both sort and combine are used. > ------------------------------------------------------------- > > Key: SPARK-46512 > URL: https://issues.apache.org/jira/browse/SPARK-46512 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core > Affects Versions: 4.0.0 > Reporter: Chenyu Zheng > Assignee: Chenyu Zheng > Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > After the shuffle reader obtains the block, it will first perform a combine > operation, and then perform a sort operation. It is known that both combine > and sort may generate temporary files, so the performance may be poor when > both sort and combine are used. In fact, combine operations can be performed > during the sort process, and we can avoid the combine spill file. > > I did not find any direct api to construct the shuffle which both sort and > combine is used. But I can do like following code, here is a wordcount, and > the output words is sorted. > {code:java} > sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). > reduceByKey(_ + _, 1). > asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). > collect().foreach(println) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org