Re: "Fetch Failure"

sandy . ryza Fri, 19 Dec 2014 06:52:09 -0800

Hi Jon,

The fix for this is to increase spark.yarn.executor.memoryOverhead to something 
greater than it's default of 384.


This will increase the gap between the executors heap size and what it requests 
from yarn. It's required because jvms take up some memory beyond their heap 
size.

-Sandy

> On Dec 19, 2014, at 9:04 AM, Jon Chase <jon.ch...@gmail.com> wrote:
> 
> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k small 
> files (~2MB each).  I do a simple map, then keyBy(), and then 
> rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to 
> spark-submit, the time before the first ExecutorLostFailure varies (more 
> memory == longer until failure) - but this usually happens after about 100 
> files being processed.  
> 
> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is killing 
> the executor b/c it thinks it's exceeding memory.  However, I can't repro any 
> OOM issues when running locally, no matter the size of the data set. 
> 
> It seems like Yarn thinks the heap size is increasing according to the Yarn 
> logs:
> 
> 2014-12-18 22:06:43,505 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory 
> used; 13.8 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:46,516 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory 
> used; 13.9 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:49,524 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory 
> used; 14.0 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:52,531 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory 
> used; 14.1 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:55,538 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory 
> used; 14.2 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:58,549 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory 
> used; 14.3 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:58,549 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Process tree for container: 
> container_1418928607193_0011_01_000002 has processes older than 1 iteration 
> running over the configured limit. Limit=6979321856, current usage = 
> 6995812352
> 2014-12-18 22:06:58,549 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  (Container Monitor): Container 
> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running 
> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical 
> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1418928607193_0011_01_000002 :
>       |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>       |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c 
> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m 
> -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>  org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>  1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1> 
> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
>  2> 
> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
>  
>       |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 
> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m 
> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>  org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>  1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 
> 
> 
> I've analyzed some heap dumps and see nothing out of the ordinary.   Would 
> love to know what could be causing this.
> 
> 
>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <swearinge...@mac.com> wrote:
>> I have a job that runs fine on relatively small input datasets but then
>> reaches a threshold where I begin to consistently get "Fetch failure" for
>> the Failure Reason, late in the job, during a saveAsText() operation.
>> 
>> The first error we are seeing on the "Details for Stage" page is
>> "ExecutorLostFailure"
>> 
>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have
>> three servers and they are configured on this job for 5g memory, and the job
>> is running in spark-shell.  The first error in the shell is "Lost executor 2
>> on (servername): remote Akka client disassociated.
>> 
>> We are still trying to understand how to best diagnose jobs using the web ui
>> so it's likely that there's some helpful info here that we just don't know
>> how to interpret -- is there any kind of "troubleshooting guide" beyond the
>> Spark Configuration page?  I don't know if I'm providing enough info here.
>> 
>> thanks.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: "Fetch Failure"

Reply via email to