GitHub user bersprockets opened a pull request:

    https://github.com/apache/spark/pull/21950

    [SPARK-24912][SQL][WIP] Add configuration to avoid OOM during broadcast 
join (and other negative side effects of incorrect table sizing)

    ## What changes were proposed in this pull request?
    
    Added configuration settings to help avoid OOM errors during broadcast 
joins.
    
    - deser multiplication factor: Tell Spark to multiply totalSize times a 
specified factor for tables with encoded files (i.e., parquet or orc files). 
Spark will do this when calculating a table's sizeInBytes. This is modelled 
after Hive's hive.stats.deserialization.factor configuration setting.
    - ignore rawDataSize: Due to HIVE-20079, rawDataSize is broken. This 
settings tells Spark to ignore rawDataSize when calculating the table's 
sizeInBytes.
    
    One can partially simulate the deser multiplication factor without this 
change by decreasing the value in spark.sql.autoBroadcastJoinThreshold. 
However, that will affect all tables, not just the ones that are encoded.
    
    There is some awkwardness in that the check for file type (parquet or orc) 
uses Hive deser names, but the checks for partitioned tables need to be made 
outside of the Hive submodule. Still working that out.
    
    ## How was this patch tested?
    
    Added unit tests.
    
    Also, checked that I can avoid broadcast join OOM errors when using the 
deser multiplication factor on both my laptop and a cluster. Also checked that 
I can avoid OOM errors using the ignore rawDataSize flag on my laptop.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bersprockets/spark SPARK-24914

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21950.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21950
    
----
commit aa2a957751a906fe538822cace019014e763a8c3
Author: Bruce Robbins <bersprockets@...>
Date:   2018-07-26T00:36:17Z

    WIP version

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to