[
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085104#comment-14085104
]
Cheng Lian commented on SPARK-2650:
---
Some additional comments after more experiments and some improvements:
# How exactly the OOMs occur when caching a large table (assume N cores and M
memory are available within a single executor):
#- Say the table is so large that the underlying RDD is divided into X
partitions (usually X N, let's assume this here)
#- When caching the table, N tasks are executed in parallel, building column
buffers, each of them is memory consuming. Say, each task consumes Y memory in
average.
#- At some point, memory consumptions of all N parallel tasks altogether,
namely N * Y exceeds available memory of the executor, an OOM is thrown
#- All tasks fail and retry, but fail again, until the driver stops retrying,
and the job fail
#- I guess the reason that this issue hasn't been reported is that, usually M
N * Y holds in production.
# Initial buffer sizes do affect the OOMs in a subtle way:
#- Too large an initial buffer size implies, apparently, larger memory
consumption
#- Too small an initial buffer size causes the {{ColumnBuilder}} keeps
allocating larger buffers to ensure enough free space to hold more elements
(12.5% larger at a time). Thus 212.5% larger space is required to finish
growing a buffer (112.5% for the new buffer + 100% for the original one).
#- A well estimated initial buffer size should be 1) large enough to avoid
buffer growing, and 2) small enough to avoid memory waste. For example, by hand
tuning, 5MB can be a good initial size for an executor with 512M memory and 1
core.
# An apparent approach to help fixing this issue is to try reducing the memory
consumption during the column building process.
#- [PR #1769|https://github.com/apache/spark/pull/1769] is submitted to reduce
memory consumption of the column building proces.
# Another approach is to estimate the initial buffer size. To do this, Shark
uses an estimated table partition size by leveraging HDFS block size and column
element default size. We can use similar approach in Spark SQL for Hive tables,
and some configurable initial size for non-Hive tables.
#- Currently {{InMemoryRelation}} resides in package
{{org.apache.spark.sql.columnar}} and doesn't know anything about Hive tables.
We can add an {{estimatedPartitionSize}} method, and override it in a new
{{InMemoryMetastoreRelation}} to estimate RDD partition sizes of a Hive table.
This will be done in another separate PR.
Wrong initial sizes for in-memory column buffers
Key: SPARK-2650
URL: https://issues.apache.org/jira/browse/SPARK-2650
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
The logic for setting up the initial column buffers is different for Spark
SQL compared to Shark and I'm seeing OOMs when caching tables that are larger
than available memory (where shark was okay).
Two suspicious things: the intialSize is always set to 0 so we always go with
the default. The default looks like it was copied from code like 10 * 1024 *
1024... but in Spark SQL its 10 * 102 * 1024.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org