[ https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-17529. ------------------------------- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15084 [https://github.com/apache/spark/pull/15084] > On highly skewed data, outer join merges are slow > ------------------------------------------------- > > Key: SPARK-17529 > URL: https://issues.apache.org/jira/browse/SPARK-17529 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.6.2, 2.0.0 > Reporter: David C Navas > Fix For: 2.1.0 > > > All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the > same performance problem. > My co-worker Yewei Zhang was investigating a performance problem with a > highly skewed dataset. > "Part of this query performs a full outer join over [an ID] on highly skewed > data. On the left view, there is one record for id = 0 out of 2,272,486 > records; On the right view there are 8,353,097 records for id = 0 out of > 12,685,073 records" > The sub-query was taking 5.2 minutes. We discovered that snappy was > responsible for some measure of this problem and incorporated the new snappy > release. This brought the sub-query down to 2.4 minutes. A large percentage > of the remaining time was spent in the merge code which I tracked down to a > BitSet clearing issue. We have noted that you already have the snappy fix, > this issue describes the problem with the BitSet. > The BitSet grows to handle the largest matching set of keys and is used to > track joins. The BitSet is re-used in all subsequent joins (unless it is too > small) > The skewing of our data caused a very large BitSet to be allocated on the > very first row joined. Unfortunately, the entire BitSet is cleared on each > re-use. For each of the remaining rows which likely match only a few rows on > the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, > this would happen roughly 6 million times. The fix I developed for this is > to clear only the portion of the BitSet that is needed. After applying it, > the sub-query dropped from 2.4 minutes to 29 seconds. > Small (0 or negative) IDs are often used as place-holders for null, empty, or > unknown data, so I expect this fix to be generally useful, if rarely > encountered to this particular degree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org