Mark Hamstra created SPARK-18631: ------------------------------------ Summary: Avoid making data skew worse in ExchangeCoordinator Key: SPARK-18631 URL: https://issues.apache.org/jira/browse/SPARK-18631 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2, 1.6.3, 2.1.0 Reporter: Mark Hamstra Assignee: Mark Hamstra
The logic to resize partitions in the ExchangeCoordinator is to not start a new partition until the targetPostShuffleInputSize is equalled or exceeded. This can make data skew problems worse since a number of small partitions can first be combined as long as the combined size remains smaller than the targetPostShuffleInputSize, and then a large, data-skewed partition can be further combined, making it even bigger than it already was. It's a fairly simple to change the logic to create a new partition if adding a new piece would exceed the targetPostShuffleInputSize instead of only creating a new partition after the targetPostShuffleInputSize has already been exceeded. This results in a few more partitions being created by the ExchangeCoordinator, but data skew problems are at least not made worse even though they are not made any better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org