Jurriaan Pruis created SPARK-16753:
--------------------------------------

             Summary: Spark SQL doesn't handle skewed dataset joins properly
                 Key: SPARK-16753
                 URL: https://issues.apache.org/jira/browse/SPARK-16753
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0, 1.6.2, 1.6.1, 2.0.1
            Reporter: Jurriaan Pruis


I'm having issues with joining a 1 billion row dataframe with skewed data with 
multiple dataframes with sizes ranging from 100,000 to 10 million rows. This 
means some of the joins (about half of them) can be done using broadcast, but 
not all.

Because the data in the large dataframe is skewed we get out of memory errors 
in the executors or errors like: 
`org.apache.spark.shuffle.FetchFailedException: Too large frame`.

We tried a lot of things, like broadcast joining the skewed rows separately and 
unioning them with the dataset containing the sort merge joined data. Which 
works perfectly when doing one or two joins, but when doing 10 joins like this 
the query planner gets confused (see [SPARK-15326]).

As most of the rows are skewed on the NULL value we use a hack where we put 
unique values in those NULL columns so the data is properly distributed over 
all partitions. This works fine for NULL values, but since this table is 
growing rapidly and we have skewed data for non-NULL values as well this isn't 
a full solution to the problem.

Right now this specific spark task runs well 30% of the time and it's getting 
worse and worse because of the increasing amount of data.

How to approach these kinds of joins using Spark? It seems weird that I can't 
find proper solutions for this problem/other people having the same kind of 
issues when Spark profiles itself as a large-scale data processing engine. 
Doing joins on big datasets should be a thing Spark should have no problem with 
out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to