Deenar Toraskar created SPARK-12358:
---------------------------------------

             Summary: Spark SQL query with lots of small tables under broadcast 
threshold leading to java.lang.OutOfMemoryError
                 Key: SPARK-12358
                 URL: https://issues.apache.org/jira/browse/SPARK-12358
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.2
            Reporter: Deenar Toraskar


Hi

I have a Spark SQL query with a lot of small tables (5x plus) all  below the 
broadcast threshold. Looking at the query plan Spark is broadcasting all these 
tables together without checking if there is sufficient memory available. This 
leads to 

Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java 
heap space 

errors and causes the executors to die and query fail.

I got around this issue by reducing the  spark.sql.autoBroadcastJoinThreshold 
to stop broadcasting the bigger tables in the query.

A fix would be to 
a) ensure that in addition to the per table threshold 
(spark.sql.autoBroadcastJoinThreshold), there is a total broadcast (say 
spark.sql.autoBroadcastJoinThresholdCumulative ) threshold per query, so only 
data up to that limit is broadcast preventing executors running out of memory.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to