Re: Memory allocation error with Spark 1.5, HashJoinCompatibilitySuite

Adam Roberts Mon, 24 Aug 2015 05:36:38 -0700

Hi, I'm regularly hitting "Unable to acquire memory" problems only when
trying to use overflow pages when running the full set of Spark tests across
different platforms. The machines I'm using all have well over 10 GB of RAM
and I'm running without any changes to the pom.xml file. Standard 3 GB Java
heap specified.


I'm working off this revision:

commit 43e0135421b2262cbb0e06aae53523f663b4f959
Author: Yin Huai <yh...@databricks.com>
Date:   Thu Aug 20 15:30:31 2015 +0800

    [SPARK-10092] [SQL] Multi-DB support follow up.

    https://issues.apache.org/jira/browse/SPARK-10092

    This pr is a follow-up one for Multi-DB support. It has the following
changes:

    * `HiveContext.refreshTable` now accepts `dbName.tableName`.

I've added prints in a variety of places, when we run just the one suite we
don't hit the problem - but with the whole batch of tests, we do.

Example below, note that it's always in the join31 test.

cat CheckHashJoinFullBatch.txt | grep -C 10 "join31"
- auto_join30
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
- auto_join31
- auto_join32
- auto_join4
- auto_join5
- auto_join6
- auto_join7
- auto_join8
- auto_join9
04:53:44.685 WARN
org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite:
Simplifications made on unsupported operations for test auto_join_filters
- auto_join_filters
- auto_join_nulls
--
05:08:18.329 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 43.0
in stage 2993.0 (TID 130982, localhost): TaskKilled (killed intentionally)
05:08:18.330 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 40.0
in stage 2993.0 (TID 130979, localhost): TaskKilled (killed intentionally)
05:08:18.340 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 15.0
in stage 2993.0 (TID 130954, localhost): TaskKilled (killed intentionally)
05:08:18.341 ERROR org.apache.spark.executor.Executor: Managed memory leak
detected; size = 12582912 bytes, TID = 130985
05:08:18.341 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 46.0
in stage 2993.0 (TID 130985, localhost): TaskKilled (killed intentionally)
05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 41.0
in stage 2993.0 (TID 130980, localhost): TaskKilled (killed intentionally)
05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 26.0
in stage 2993.0 (TID 130965, localhost): TaskKilled (killed intentionally)
05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 4.0
in stage 2993.0 (TID 130943, localhost): TaskKilled (killed intentionally)
05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 11.0
in stage 2993.0 (TID 130950, localhost): TaskKilled (killed intentionally)
05:08:18.349 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 28.0
in stage 2993.0 (TID 130967, localhost): TaskKilled (killed intentionally)
- join31 *** FAILED ***
  Failed to execute query using catalyst:
  Error: Job aborted due to stage failure: Task 42 in stage 2993.0 failed 1
times, most recent failure: Lost task 42.0 in stage 2993.0 (TID 130981,
localhost): java.io.IOException: Unable to acquire 4194304 bytes of memory
        at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:371)
        at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:350)
        at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:489)
        at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
        at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:477)
        at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
        at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:610)
        at
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)

I run the test on its own with:
mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver
-DwildcardSuites=org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite
-fn test > CheckHashJoin.txt 2>&1

I run the whole batch with
mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -fn test >
CheckHashJoinFullBatch.txt 2>&1

java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el7_0-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Note the below is also the case when we have plenty of memory reported as
free (10 GB+)
free -m
             total       used       free     shared    buffers     cached
Mem:         11855      11389        466        668          0       3305
-/+ buffers/cache:       8084       3771
Swap:         6023         83       5940

Potentially useful debug info when it passes:

Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
ˆ[[32m- join31ˆ[[0m

When it fails my printout for if (useOverflowPage) is set to true.
The output features:

creating with existing in memory sorter, pageSizeBytes: 4194304
Creating unsafe external sorter, pageSizeBytes: 4194304
determined total space required is: 24
*decided to use overflow page*
*Required space (24) is less than free space in current page (0)*
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 1433178
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
07:41:01.442 ERROR org.apache.spark.executor.Executor: Managed memory leak
detected; size = 8388608 bytes, TID = 230633
Creating unsafe external sorter, pageSizeBytes: 4194304
acquiring 4194304 from shuffle memory manager
memoryAcquired is: 4194304
creating with existing in memory sorter, pageSizeBytes: 4194304
07:41:01.442 ERROR org.apache.spark.executor.Executor: Exception in task 4.0
in stage 6400.0 (TID 230633)
java.io.IOException: Unable to acquire 4194304 bytes of memory

Note that I was hitting the unable to acquire memory problems with the
default pageSize and this was addressed by the helpful post  here
<https://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCA+LY3qkm2fH_ioMN6a-f+YvFEhavskZR73wbKZaZ=wvf9+o...@mail.gmail.com%3E>
  

Perhaps I need to set another option or change the value? 

My top level pom.xml features
<spark.buffer.pageSize>4m</spark.buffer.pageSize> for both Java and Scala
tests.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Memory-allocation-error-with-Spark-1-5-HashJoinCompatibilitySuite-tp24416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Memory allocation error with Spark 1.5, HashJoinCompatibilitySuite

Reply via email to