Michael Wu created SPARK-27628:
----------------------------------

             Summary: SortMergeJoin on a low-cardinality column results in 
heavy skew and large partitions
                 Key: SPARK-27628
                 URL: https://issues.apache.org/jira/browse/SPARK-27628
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.2
            Reporter: Michael Wu


Let's say we have a dataframe *a* that looks like this:
{code:java}
| temp | active |
|------|--------|
| 123  | Yes    |
| 1235 | No     |
...{code}
where the *active* column only contains two string values - "Yes" and "No". 

Let's say we do a join with some other dataframe *b* using *active* as the join 
key. Assume neither *a* nor *b* is not small enough to allow for a broadcast 
join and Spark is forced to do a sort-merge-join.

In a sort-merge-join, Spark will partition data frames on both sides of the 
join, using the values in the join column to determine the partition in which a 
given row belong. This appears to be catastrophic when the column has 
low-cardinality. In the case of **dataframe *a*, it would be partitioned into X 
partitions by hashing the value in the *active* column and modulo X (where X is 
the value spark.sql.shuffle.partitions) - this means that only two partitions 
would have any data while rest are empty. In cases where **the dataframes 
involved in the join are large, this can add a lot of pressure on disk usage, 
not to mention reduced join performance due to pretty extreme skew. Is there 
anyway around this behavior?

Current workaround I can think of is introducing additional columns as join 
keys to more evenly distribute data during the partitioning part of the join. 
Is this the recommended approach?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to