I noticed that some executors have issue with scratch space.
I see the following in yarn app container stderr around the time when yarn
killed the executor because it uses too much memory.

-- App container stderr --
15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache
rdd_6_346 in memory! (computed 3.0 GB so far)
15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks)
+ 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage
limit = 25.2 GB.
15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_346
to disk instead.
15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache
rdd_6_49 in memory! (computed 3.1 GB so far)
15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks)
+ 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage
limit = 25.2 GB.
15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_49 to
disk instead.

-- Yarn Nodemanager log --
2015-09-21 21:44:05,716 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Container
[pid=5114,containerID=container_1442869100946_0001_01_0
00056] is running beyond physical memory limits. Current usage: 52.2 GB of
52 GB physical memory used; 53.0 GB of 260 GB virtual memory used. Killing
container.
Dump of the process-tree for container_1442869100946_0001_01_000056 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 5117 5114 5114 5114 (java) 1322810 27563 56916316160 13682772
/usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError=kill %p
-Xms47924m -Xmx47924m -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled
-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/tmp
-Dspark.akka.failure-detector.threshold=3000.0
-Dspark.akka.heartbeat.interval=10000s -Dspark.akka.threads=4
-Dspark.history.ui.port=18080 -Dspark.akka.heartbeat.pauses=60000s
-Dspark.akka.timeout=1000s -Dspark.akka.frameSize=50
-Dspark.driver.port=52690
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler
--executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id
application_1442869100946_0001 --user-class-path
file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/__app__.jar
        |- 5114 5112 5114 5114 (bash) 0 0 9658368 291 /bin/bash -c
/usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms47924m -Xmx47924m '-verbose:gc' '-XX:+PrintGCDetails'
'-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC'
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70'
'-XX:+CMSClassUnloadingEnabled'
-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/tmp
'-Dspark.akka.failure-detector.threshold=3000.0'
'-Dspark.akka.heartbeat.interval=10000s' '-Dspark.akka.threads=4'
'-Dspark.history.ui.port=18080' '-Dspark.akka.heartbeat.pauses=60000s'
'-Dspark.akka.timeout=1000s' '-Dspark.akka.frameSize=50'
'-Dspark.driver.port=52690'
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler
--executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id
application_1442869100946_0001 --user-class-path
file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/__app__.jar
1>
/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056/stdout
2>
/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056/stderr



Is it possible to get what scratch space is used for?

What spark setting should I try to adjust to solve the issue?

On Thu, Sep 10, 2015 at 2:52 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> YARN will never kill processes for being unresponsive.
>
> It may kill processes for occupying more memory than it allows.  To get
> around this, you can either bump spark.yarn.executor.memoryOverhead or turn
> off the memory checks entirely with yarn.nodemanager.pmem-check-enabled.
>
> -Sandy
>
> On Tue, Sep 8, 2015 at 10:48 PM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> The problem which we have now is skew data (2360 tasks done in 5 min, 3
>> tasks in 40 min and 1 task in 2 hours)
>>
>> Some people from the team worry that the executor which runs the longest
>> task can be killed by YARN (because executor might be unresponsive because
>> of GC or it might occupy more memory than Yarn allows)
>>
>>
>>
>> On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>>
>>> Those settings seem reasonable to me.
>>>
>>> Are you observing performance that's worse than you would expect?
>>>
>>> -Sandy
>>>
>>> On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
>>>> Hi Sandy
>>>>
>>>> Thank you for your reply
>>>> Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB)
>>>> with emr setting for Spark "maximizeResourceAllocation": "true"
>>>>
>>>> It is automatically converted to Spark settings
>>>> spark.executor.memory            47924M
>>>> spark.yarn.executor.memoryOverhead 5324
>>>>
>>>> we also set spark.default.parallelism = slave_count * 16
>>>>
>>>> Does it look good for you? (we run single heavy job on cluster)
>>>>
>>>> Alex
>>>>
>>>> On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.r...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> If they're both configured correctly, there's no reason that Spark
>>>>> Standalone should provide performance or memory improvement over Spark on
>>>>> YARN.
>>>>>
>>>>> -Sandy
>>>>>
>>>>> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov <
>>>>> apivova...@gmail.com> wrote:
>>>>>
>>>>>> Hi Everyone
>>>>>>
>>>>>> We are trying the latest aws emr-4.0.0 and Spark and my question is
>>>>>> about YARN vs Standalone mode.
>>>>>> Our usecase is
>>>>>> - start 100-150 nodes cluster every week,
>>>>>> - run one heavy spark job (5-6 hours)
>>>>>> - save data to s3
>>>>>> - stop cluster
>>>>>>
>>>>>> Officially aws emr-4.0.0 comes with Spark on Yarn
>>>>>> It's probably possible to hack emr by creating bootstrap script which
>>>>>> stops yarn and starts master and slaves on each computer  (to start Spark
>>>>>> in standalone mode)
>>>>>>
>>>>>> My questions are
>>>>>> - Does Spark standalone provides significant performance / memory
>>>>>> improvement in comparison to YARN mode?
>>>>>> - Does it worth hacking official emr Spark on Yarn and switch Spark
>>>>>> to Standalone mode?
>>>>>>
>>>>>>
>>>>>> I already created comparison table and want you to check if my
>>>>>> understanding is correct
>>>>>>
>>>>>> Lets say r3.2xlarge computer has 52GB ram available for Spark
>>>>>> Executor JVMs
>>>>>>
>>>>>>                     standalone to yarn comparison
>>>>>>
>>>>>>
>>>>>>               STDLN   YARN
>>>>>>
>>>>>> can executor allocate up to 52GB ram                           - yes
>>>>>>  |  yes
>>>>>>
>>>>>> will executor be unresponsive after using all 52GB ram because of GC
>>>>>> - yes  |  yes
>>>>>>
>>>>>> additional JVMs on slave except of spark executor        - workr |
>>>>>> node mngr
>>>>>>
>>>>>> are additional JVMs lightweight                                     -
>>>>>> yes  |  yes
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to