Thanks Rohit, Roddick and Shreya. I tried changing spark.yarn.executor.memoryOverhead to be 10GB and lowering executor memory to 30 GB and both of these didn't work. I finally had to reduce the number of cores per executor to be 18 (from 36) in addition to setting higher spark.yarn.executor.memoryOverhead and lower executor memory size. I had to trade off performance for reliability.
Unfortunately, spark does a poor job reporting off heap memory usage. From the profiler, it seems that the job's heap usage is pretty static but the off heap memory fluctuates quiet a lot. It looks like bulk of off heap is used by io.netty.buffer.UnpooledUnsafeDirectByteBuf while the shuffle client is trying to read block from shuffle service. It looks like org.apache.spark.network.util.TransportFrameDecoder retains them in buffers field while decoding responses from the shuffle service. So far, it's not clear why it needs to hold multiple GBs in the buffers. Perhaps increasing the number of partitions may help with this. Thanks, Aniket On Fri, Nov 25, 2016 at 1:09 AM Shreya Agarwal <shrey...@microsoft.com> wrote: I don’t think it’s just memory overhead. It might be better to use an execute with lesser heap space(30GB?). 46 GB would mean more data load into memory and more GC, which can cause issues. Also, have you tried to persist data in any way? If so, then that might be causing an issue. Lastly, I am not sure if your data has a skew and if that is forcing a lot of data to be on one executor node. Sent from my Windows 10 phone *From: *Rodrick Brown <rodr...@orchardplatform.com> *Sent: *Friday, November 25, 2016 12:25 AM *To: *Aniket Bhatnagar <aniket.bhatna...@gmail.com> *Cc: *user <user@spark.apache.org> *Subject: *Re: OS killing Executor due to high (possibly off heap) memory usage Try setting spark.yarn.executor.memoryOverhead 10000 On Thu, Nov 24, 2016 at 11:16 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: Hi Spark users I am running a job that does join of a huge dataset (7 TB+) and the executors keep crashing randomly, eventually causing the job to crash. There are no out of memory exceptions in the log and looking at the dmesg output, it seems like the OS killed the JVM because of high memory usage. My suspicion is towards off heap usage of executor is causing this as I am limiting the on heap usage of executor to be 46 GB and each host running the executor has 60 GB of RAM. After the executor crashes, I can see that the external shuffle manager (org.apache.spark.network.server.TransportRequestHandler) logs a lot of channel closed exceptions in yarn node manager logs. This leads me to believe that something triggers out of memory during shuffle read. Is there a configuration to completely disable usage of off heap memory? I have tried setting spark.shuffle.io.preferDirectBufs=false but the executor is still getting killed by the same error. Cluster details: 10 AWS c4.8xlarge hosts RAM on each host - 60 GB Number of cores on each host - 36 Additional hard disk on each host - 8 TB Spark configuration: dynamic allocation enabled external shuffle service enabled spark.driver.memory 1024M spark.executor.memory 47127M Spark master yarn-cluster Sample error in yarn node manager: 2016-11-24 10:34:06,507 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-50): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=919299554123, chunkIndex=0}, buffer=FileSegmentManagedBuffer{file=/mnt3/yarn/usercache/hadoop/appcache/application_1479898345621_0006/blockmgr-ad5301a9-e1e9-4723-a8c4-9276971b2259/2c/shuffle_3_963_0.data, offset=0, length=669014456}} to /10.192.108.170:52782; closing connection java.nio.channels.ClosedChannelException Error in dmesg: [799873.309897] Out of memory: Kill process 50001 (java) score 927 or sacrifice child [799873.314439] Killed process 50001 (java) total-vm:65652448kB, anon-rss:57246528kB, file-rss:0kB Thanks, Aniket -- [image: Orchard Platform] <http://www.orchardplatform.com/> *Rodrick Brown */ *DevOPs* 9174456839 / rodr...@orchardplatform.com Orchard Platform 101 5th Avenue, 4th Floor, New York, NY *NOTICE TO RECIPIENTS*: This communication is confidential and intended for the use of the addressee only. If you are not an intended recipient of this communication, please delete it immediately and notify the sender by return email. Unauthorized reading, dissemination, distribution or copying of this communication is prohibited. This communication does not constitute an offer to sell or a solicitation of an indication of interest to purchase any loan, security or any other financial product or instrument, nor is it an offer to sell or a solicitation of an indication of interest to purchase any products or services to any persons who are prohibited from receiving such information under applicable law. The contents of this communication may not be accurate or complete and are subject to change without notice. As such, Orchard App, Inc. (including its subsidiaries and affiliates, "Orchard") makes no representation regarding the accuracy or completeness of the information contained herein. The intended recipient is advised to consult its own professional advisors, including those specializing in legal, tax and accounting matters. Orchard does not provide legal, tax or accounting advice.