Hi Spark users

I am running a job that does join of a huge dataset (7 TB+) and the
executors keep crashing randomly, eventually causing the job to crash.
There are no out of memory exceptions in the log and looking at the dmesg
output, it seems like the OS killed the JVM because of high memory usage.
My suspicion is towards off heap usage of executor is causing this as I am
limiting the on heap usage of executor to be 46 GB and each host running
the executor has 60 GB of RAM. After the executor crashes, I can see that
the external shuffle manager
(org.apache.spark.network.server.TransportRequestHandler) logs a lot of
channel closed exceptions in yarn node manager logs. This leads me to
believe that something triggers out of memory during shuffle read. Is there
a configuration to completely disable usage of off heap memory? I have
tried setting spark.shuffle.io.preferDirectBufs=false but the executor is
still getting killed by the same error.

Cluster details:
10 AWS c4.8xlarge hosts
RAM on each host - 60 GB
Number of cores on each host - 36
Additional hard disk on each host - 8 TB

Spark configuration:
dynamic allocation enabled
external shuffle service enabled
spark.driver.memory 1024M
spark.executor.memory 47127M
Spark master yarn-cluster

Sample error in yarn node manager:
2016-11-24 10:34:06,507 ERROR
org.apache.spark.network.server.TransportRequestHandler
(shuffle-server-50): Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=919299554123,
chunkIndex=0},
buffer=FileSegmentManagedBuffer{file=/mnt3/yarn/usercache/hadoop/appcache/application_1479898345621_0006/blockmgr-ad5301a9-e1e9-4723-a8c4-9276971b2259/2c/shuffle_3_963_0.data,
offset=0, length=669014456}} to /10.192.108.170:52782; closing connection
java.nio.channels.ClosedChannelException

Error in dmesg:
[799873.309897] Out of memory: Kill process 50001 (java) score 927 or
sacrifice child
[799873.314439] Killed process 50001 (java) total-vm:65652448kB,
anon-rss:57246528kB, file-rss:0kB

Thanks,
Aniket

Reply via email to