[ https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xingbo Jiang reassigned SPARK-43043: ------------------------------------ Assignee: Xingbo Jiang > Improve the performance of MapOutputTracker.updateMapOutput > ----------------------------------------------------------- > > Key: SPARK-43043 > URL: https://issues.apache.org/jira/browse/SPARK-43043 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.3.2 > Reporter: Xingbo Jiang > Assignee: Xingbo Jiang > Priority: Major > > Inside of MapOutputTracker, there is a line of code which does a linear find > through a mapStatuses collection: > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 > (plus a similar search a few lines down at > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) > This scan is necessary because we only know the mapId of the updated status > and not its mapPartitionId. > We perform this scan once per migrated block, so if a large proportion of all > blocks in the map are migrated then we get O(n^2) total runtime across all of > the calls. > I think we might be able to fix this by extending ShuffleStatus to have an > OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org