Hi Spark users I am running a job that does join of a huge dataset (7 TB+) and the executors keep crashing randomly, eventually causing the job to crash. There are no out of memory exceptions in the log and looking at the dmesg output, it seems like the OS killed the JVM because of high memory usage. My suspicion is towards off heap usage of executor is causing this as I am limiting the on heap usage of executor to be 46 GB and each host running the executor has 60 GB of RAM. After the executor crashes, I can see that the external shuffle manager (org.apache.spark.network.server.TransportRequestHandler) logs a lot of channel closed exceptions in yarn node manager logs. This leads me to believe that something triggers out of memory during shuffle read. Is there a configuration to completely disable usage of off heap memory? I have tried setting spark.shuffle.io.preferDirectBufs=false but the executor is still getting killed by the same error.
Cluster details: 10 AWS c4.8xlarge hosts RAM on each host - 60 GB Number of cores on each host - 36 Additional hard disk on each host - 8 TB Spark configuration: dynamic allocation enabled external shuffle service enabled spark.driver.memory 1024M spark.executor.memory 47127M Spark master yarn-cluster Sample error in yarn node manager: 2016-11-24 10:34:06,507 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-50): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=919299554123, chunkIndex=0}, buffer=FileSegmentManagedBuffer{file=/mnt3/yarn/usercache/hadoop/appcache/application_1479898345621_0006/blockmgr-ad5301a9-e1e9-4723-a8c4-9276971b2259/2c/shuffle_3_963_0.data, offset=0, length=669014456}} to /10.192.108.170:52782; closing connection java.nio.channels.ClosedChannelException Error in dmesg: [799873.309897] Out of memory: Kill process 50001 (java) score 927 or sacrifice child [799873.314439] Killed process 50001 (java) total-vm:65652448kB, anon-rss:57246528kB, file-rss:0kB Thanks, Aniket