I noticed that some executors have issue with scratch space. I see the following in yarn app container stderr around the time when yarn killed the executor because it uses too much memory.
-- App container stderr -- 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache rdd_6_346 in memory! (computed 3.0 GB so far) 15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks) + 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage limit = 25.2 GB. 15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_346 to disk instead. 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache rdd_6_49 in memory! (computed 3.1 GB so far) 15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks) + 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage limit = 25.2 GB. 15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_49 to disk instead. -- Yarn Nodemanager log -- 2015-09-21 21:44:05,716 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=5114,containerID=container_1442869100946_0001_01_0 00056] is running beyond physical memory limits. Current usage: 52.2 GB of 52 GB physical memory used; 53.0 GB of 260 GB virtual memory used. Killing container. Dump of the process-tree for container_1442869100946_0001_01_000056 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 5117 5114 5114 5114 (java) 1322810 27563 56916316160 13682772 /usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms47924m -Xmx47924m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/tmp -Dspark.akka.failure-detector.threshold=3000.0 -Dspark.akka.heartbeat.interval=10000s -Dspark.akka.threads=4 -Dspark.history.ui.port=18080 -Dspark.akka.heartbeat.pauses=60000s -Dspark.akka.timeout=1000s -Dspark.akka.frameSize=50 -Dspark.driver.port=52690 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler --executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id application_1442869100946_0001 --user-class-path file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/__app__.jar |- 5114 5112 5114 5114 (bash) 0 0 9658368 291 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms47924m -Xmx47924m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/tmp '-Dspark.akka.failure-detector.threshold=3000.0' '-Dspark.akka.heartbeat.interval=10000s' '-Dspark.akka.threads=4' '-Dspark.history.ui.port=18080' '-Dspark.akka.heartbeat.pauses=60000s' '-Dspark.akka.timeout=1000s' '-Dspark.akka.frameSize=50' '-Dspark.driver.port=52690' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler --executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id application_1442869100946_0001 --user-class-path file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/__app__.jar 1> /var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056/stdout 2> /var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056/stderr Is it possible to get what scratch space is used for? What spark setting should I try to adjust to solve the issue? On Thu, Sep 10, 2015 at 2:52 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > YARN will never kill processes for being unresponsive. > > It may kill processes for occupying more memory than it allows. To get > around this, you can either bump spark.yarn.executor.memoryOverhead or turn > off the memory checks entirely with yarn.nodemanager.pmem-check-enabled. > > -Sandy > > On Tue, Sep 8, 2015 at 10:48 PM, Alexander Pivovarov <apivova...@gmail.com > > wrote: > >> The problem which we have now is skew data (2360 tasks done in 5 min, 3 >> tasks in 40 min and 1 task in 2 hours) >> >> Some people from the team worry that the executor which runs the longest >> task can be killed by YARN (because executor might be unresponsive because >> of GC or it might occupy more memory than Yarn allows) >> >> >> >> On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.r...@cloudera.com> >> wrote: >> >>> Those settings seem reasonable to me. >>> >>> Are you observing performance that's worse than you would expect? >>> >>> -Sandy >>> >>> On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov < >>> apivova...@gmail.com> wrote: >>> >>>> Hi Sandy >>>> >>>> Thank you for your reply >>>> Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) >>>> with emr setting for Spark "maximizeResourceAllocation": "true" >>>> >>>> It is automatically converted to Spark settings >>>> spark.executor.memory 47924M >>>> spark.yarn.executor.memoryOverhead 5324 >>>> >>>> we also set spark.default.parallelism = slave_count * 16 >>>> >>>> Does it look good for you? (we run single heavy job on cluster) >>>> >>>> Alex >>>> >>>> On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.r...@cloudera.com> >>>> wrote: >>>> >>>>> Hi Alex, >>>>> >>>>> If they're both configured correctly, there's no reason that Spark >>>>> Standalone should provide performance or memory improvement over Spark on >>>>> YARN. >>>>> >>>>> -Sandy >>>>> >>>>> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov < >>>>> apivova...@gmail.com> wrote: >>>>> >>>>>> Hi Everyone >>>>>> >>>>>> We are trying the latest aws emr-4.0.0 and Spark and my question is >>>>>> about YARN vs Standalone mode. >>>>>> Our usecase is >>>>>> - start 100-150 nodes cluster every week, >>>>>> - run one heavy spark job (5-6 hours) >>>>>> - save data to s3 >>>>>> - stop cluster >>>>>> >>>>>> Officially aws emr-4.0.0 comes with Spark on Yarn >>>>>> It's probably possible to hack emr by creating bootstrap script which >>>>>> stops yarn and starts master and slaves on each computer (to start Spark >>>>>> in standalone mode) >>>>>> >>>>>> My questions are >>>>>> - Does Spark standalone provides significant performance / memory >>>>>> improvement in comparison to YARN mode? >>>>>> - Does it worth hacking official emr Spark on Yarn and switch Spark >>>>>> to Standalone mode? >>>>>> >>>>>> >>>>>> I already created comparison table and want you to check if my >>>>>> understanding is correct >>>>>> >>>>>> Lets say r3.2xlarge computer has 52GB ram available for Spark >>>>>> Executor JVMs >>>>>> >>>>>> standalone to yarn comparison >>>>>> >>>>>> >>>>>> STDLN YARN >>>>>> >>>>>> can executor allocate up to 52GB ram - yes >>>>>> | yes >>>>>> >>>>>> will executor be unresponsive after using all 52GB ram because of GC >>>>>> - yes | yes >>>>>> >>>>>> additional JVMs on slave except of spark executor - workr | >>>>>> node mngr >>>>>> >>>>>> are additional JVMs lightweight - >>>>>> yes | yes >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> Alex >>>>>> >>>>> >>>>> >>>> >>> >> >