Try setting spark.yarn.executor.memoryOverhead 10000

On Thu, Nov 24, 2016 at 11:16 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Hi Spark users
>
> I am running a job that does join of a huge dataset (7 TB+) and the
> executors keep crashing randomly, eventually causing the job to crash.
> There are no out of memory exceptions in the log and looking at the dmesg
> output, it seems like the OS killed the JVM because of high memory usage.
> My suspicion is towards off heap usage of executor is causing this as I am
> limiting the on heap usage of executor to be 46 GB and each host running
> the executor has 60 GB of RAM. After the executor crashes, I can see that
> the external shuffle manager 
> (org.apache.spark.network.server.TransportRequestHandler)
> logs a lot of channel closed exceptions in yarn node manager logs. This
> leads me to believe that something triggers out of memory during shuffle
> read. Is there a configuration to completely disable usage of off heap
> memory? I have tried setting spark.shuffle.io.preferDirectBufs=false but
> the executor is still getting killed by the same error.
>
> Cluster details:
> 10 AWS c4.8xlarge hosts
> RAM on each host - 60 GB
> Number of cores on each host - 36
> Additional hard disk on each host - 8 TB
>
> Spark configuration:
> dynamic allocation enabled
> external shuffle service enabled
> spark.driver.memory 1024M
> spark.executor.memory 47127M
> Spark master yarn-cluster
>
> Sample error in yarn node manager:
> 2016-11-24 10:34:06,507 ERROR 
> org.apache.spark.network.server.TransportRequestHandler
> (shuffle-server-50): Error sending result ChunkFetchSuccess{
> streamChunkId=StreamChunkId{streamId=919299554123, chunkIndex=0}, buffer=
> FileSegmentManagedBuffer{file=/mnt3/yarn/usercache/hadoop/
> appcache/application_1479898345621_0006/blockmgr-ad5301a9-e1e9-4723-a8c4-
> 9276971b2259/2c/shuffle_3_963_0.data, offset=0, length=669014456}} to /
> 10.192.108.170:52782; closing connection
> java.nio.channels.ClosedChannelException
>
> Error in dmesg:
> [799873.309897] Out of memory: Kill process 50001 (java) score 927 or
> sacrifice child
> [799873.314439] Killed process 50001 (java) total-vm:65652448kB,
> anon-rss:57246528kB, file-rss:0kB
>
> Thanks,
> Aniket
>



-- 

[image: Orchard Platform] <http://www.orchardplatform.com/>

*Rodrick Brown */ *DevOPs*

9174456839 / rodr...@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY

-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.

Reply via email to