GitHub user bersprockets opened a pull request: https://github.com/apache/spark/pull/21950
[SPARK-24912][SQL][WIP] Add configuration to avoid OOM during broadcast join (and other negative side effects of incorrect table sizing) ## What changes were proposed in this pull request? Added configuration settings to help avoid OOM errors during broadcast joins. - deser multiplication factor: Tell Spark to multiply totalSize times a specified factor for tables with encoded files (i.e., parquet or orc files). Spark will do this when calculating a table's sizeInBytes. This is modelled after Hive's hive.stats.deserialization.factor configuration setting. - ignore rawDataSize: Due to HIVE-20079, rawDataSize is broken. This settings tells Spark to ignore rawDataSize when calculating the table's sizeInBytes. One can partially simulate the deser multiplication factor without this change by decreasing the value in spark.sql.autoBroadcastJoinThreshold. However, that will affect all tables, not just the ones that are encoded. There is some awkwardness in that the check for file type (parquet or orc) uses Hive deser names, but the checks for partitioned tables need to be made outside of the Hive submodule. Still working that out. ## How was this patch tested? Added unit tests. Also, checked that I can avoid broadcast join OOM errors when using the deser multiplication factor on both my laptop and a cluster. Also checked that I can avoid OOM errors using the ignore rawDataSize flag on my laptop. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bersprockets/spark SPARK-24914 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21950.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21950 ---- commit aa2a957751a906fe538822cace019014e763a8c3 Author: Bruce Robbins <bersprockets@...> Date: 2018-07-26T00:36:17Z WIP version ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org