Hi, all.
Just wanted to provide an update, which is that I’m finally getting good YARN
cluster utilization (consistently within the 90-100% range!). I believe the
biggest change was to increase the min split size. Since our input is all in
S3 and data locality is not really an issue, I bumped it up to 2G to minimize
the impact of allocation/deallocation of container resources, since each
container will be up working for longer, so that now occurs less frequently.
<property><name>mapreduce.input.fileinputformat.split.minsize</name><value>2147483648</value><!--
2G --></property>
Not sure how much impact the following changes had, since they were made at the
same time. Everything’s humming along now though, so I’m going to leave them.
I also reduced the node heartbeat interval from 1000ms down to 500ms
("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms": "500" in cluster
configuration JSON), since I’m told that NodeManager will only allocate 1
container per node per heartbeat when dealing with non-localized data, like we
are since it’s in S3. I also doubled the memory given to the YARN Resource
Manager from the default for the m3.xlarge node type I’m using
("YARN_RESOURCEMANAGER_HEAPSIZE": "5120" in cluster configuration JSON).
Thanks again to Sunil and Shubh (and my colleague, York) for the helpful
guidance!
Take care,
-Jeff
From: Shubh hadoopExp [mailto:[email protected]]
Sent: Wednesday, May 25, 2016 11:08 PM
To: Guttadauro, Jeff <[email protected]>
Cc: Sunil Govind <[email protected]>; [email protected]
Subject: Re: YARN cluster underutilization
Hey,
OFFSWITCH allocation means if the data locality is maintained or not. It has no
relation with heartbeat! Heartbeat is just used to clear the pipelining of
Container request.
-Shubh
On May 25, 2016, at 3:30 PM, Guttadauro, Jeff
<[email protected]<mailto:[email protected]>> wrote:
Interesting stuff! I did not know about this handling of OFFSWITCH requests.
To get around this, would you recommend reducing the heartbeat interval,
perhaps to 250ms to get a 4x improvement in container allocation rate (or is it
not quite as simple as that)? Maybe doing this in combination with using a
greater number of smaller nodes would help? Would overloading the
ResourceManager be a concern if doing that? Should I bump up the
“YARN_RESOURCEMANAGER_HEAPSIZE” configuration property (current default for
m3.xlarge is 2396M), or would you suggest any other knobs to turn to help RM
handle it?
Thanks again for all your help, Sunil!
From: Sunil Govind [mailto:[email protected]]
Sent: Wednesday, May 25, 2016 1:07 PM
To: Guttadauro, Jeff
<[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: YARN cluster underutilization
Hi Jeff,
I do see the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms property
set to 1000 in the job configuration
>> Ok, This make sense.. node heartbeat seems default.
If there are no locality specified in resource requests (using
ResourceRequest.ANY) , then YARN will allocate only one container per node
heartbeat. So your container allocation rate is slower considering 600k
requests and only 20 nodes. And if more number of containers are also getting
released fast (I could see that some containers lifetime is 80 to 90 secs),
then this will become more complex and container allocation rate will be slower.
YARN-4963<https://issues.apache.org/jira/browse/YARN-4963> is trying to make
more allocation per heartbeat for NODE_OFFSWITCH (ANY) requests. But its not
yet available in any release.
I guess you can investigate more in this line to confirm this points.
Thanks
Sunil
On Wed, May 25, 2016 at 11:00 PM Guttadauro, Jeff
<[email protected]<mailto:[email protected]>> wrote:
Thanks for digging into the log, Sunil, and making some interesting
observations!
The heartbeat interval hasn’t been changed from its default, and I do see the
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms property set to 1000 in
the job configuration. I was searching in the log for heartbeat interval
information, but I didn’t find anything. Where do you look in the log for the
heartbeats?
Also, you are correct about there being no data locality, as all the input data
is in S3. The utilization has been fluctuating, but I can’t really see a
pattern or tell why. It actually started out pretty low in the 20-30% range
and then managed to get up into the 50-70% range after a while, but that was
short-lived, as it went back down into the 20-30% range for quite a while.
While writing this, I saw it surprisingly hit 80%!! First time I’ve seen it
that high in the 20 hours it’s been running… Although looks like it may be
headed back down. I’m perplexed. Wouldn’t you generally expect fairly stable
utilization over the course of the job? (This is the only job running.)
Thanks,
-Jeff
From: Sunil Govind
[mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, May 25, 2016 11:55 AM
To: Guttadauro, Jeff
<[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: YARN cluster underutilization
Hi Jeff.
Thanks for sharing this information. I have some observations from this logs.
- I think the node heartbeat is around 2/3 seconds here. Is it changed due to
some other reasons?
- And all mappers Resource Request seems to be asking for type ANY (there is no
data locality). pls correct me if I am wrong.
If the resource request type is ANY, only one container will be allocated per
heartbeat for a node. Here node heartbeat delay is also more. And I can see
that containers are released very fast too. So when u started you application,
are you seeing more better resource utilization? And once containers started to
get released/completed, you are seeing under utilization.
Pls look into this line. It may be a reason.
Thanks
Sunil
On Wed, May 25, 2016 at 9:59 PM Guttadauro, Jeff
<[email protected]<mailto:[email protected]>> wrote:
Thanks for your thoughts thus far, Sunil. Most grateful for any additional
help you or others can offer. To answer your questions,
1. This is a custom M/R job, which uses mappers only (no reduce phase) to
process GPS probe data and filter based on inclusion within a provided polygon.
There is actually a lot of upfront work done in the driver to make that task
as simple as can be (identifies a list of tiles that are completely inside the
polygon and those that fall across an edge, for which more processing would be
needed), but the job would still be more compute-intensive than wordcount, for
example.
2. I’m running almost 84k mappers for this job. This is actually down
from ~600k mappers, since one other thing I’ve done is increased the
mapreduce.input.fileinputformat.split.minsize to 536870912 (512M) for the job.
Data is in S3, so loss of locality isn’t really a concern.
3. For NodeManager configuration, I’m using EMR’s default configuration
for the m3.xlarge instance type, which is
yarn.scheduler.minimum-allocation-mb=32,
yarn.scheduler.maximum-allocation-mb=11520, and
yarn.nodemanager.resource.memory-mb=11520. YARN dashboard shows min/max
allocations of <memory:32, vCores:1>/<memory:11520, vCores:8>.
4. Capacity Scheduler [MEMORY]
5. I’ve attached 2500 lines from the RM log. Happy to grab more, but
they are pretty big, and I thought that might be sufficient.
Any guidance is much appreciated!
-Jeff
From: Sunil Govind
[mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, May 25, 2016 10:55 AM
To: Guttadauro, Jeff
<[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: YARN cluster underutilization
Hi Jeff,
It looks like to you are allocating more memory for AM container. Mostly you
might not need 6Gb (as per the log). Could you please help to provide some
more information.
1. What type of mapreduce application (wordcount etc) are you running? Some AMs
may be CPU intensive and some may not be. So based on the type application,
memory/cpu can be tuned for better utilization.
2. How many mappers (reducers) are you trying to run here?
3. You have mentioned that each node has 8 cores and 15GB, but how much is
actually configured for NM?
4. Which scheduler are you using?
5. Its better to attach RM log if possible.
Thanks
Sunil
On Wed, May 25, 2016 at 8:58 PM Guttadauro, Jeff
<[email protected]<mailto:[email protected]>> wrote:
Hi, all.
I have an M/R (map-only) job that I’m running on a Hadoop 2.7.1 YARN cluster
that is being quite underutilized (utilization of around 25-30%). The EMR
cluster is 1 master + 20 core m3.xlarge nodes, which have 8 cores each and 15G
total memory (with 11.25G of that available to YARN). I’ve configured mapper
memory with the following properties, which should allow for 8 containers
running map tasks per node:
<property><name>mapreduce.map.memory.mb</name><value>1440</value></property>
<!-- Container size -->
<property><name>mapreduce.map.java.opts</name><value>-Xmx1024m</value></property>
<!-- JVM arguments for a Map task -->
It was suggested that perhaps my AppMaster was having trouble keeping up with
creating all the mapper containers and that I bulk up its resource allocation.
So I did, as shown below, providing it 6G container memory (5G task memory), 3
cores, and 60 task listener threads.
<property><name>yarn.app.mapreduce.am.job.task.listener.thread-count</name><value>60</value></property>
<!-- App Master task listener threads -->
<property><name>yarn.app.mapreduce.am.resource.cpu-vcores</name><value>3</value></property>
<!-- App Master container vcores -->
<property><name>yarn.app.mapreduce.am.resource.mb</name><value>6400</value></property>
<!-- App Master container size -->
<property><name>yarn.app.mapreduce.am.command-opts</name><value>-Xmx5120m</value></property>
<!-- JVM arguments for each Application Master -->
Taking a look at the node on which the AppMaster is running, I'm seeing plenty
of CPU idle time and free memory, yet there are still nodes with no utilization
(0 running containers). The log indicates that the AppMaster has way more
memory (physical/virtual) than it appears to need with repeated log messages
like this:
2016-05-25 13:59:04,615 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 11265 for container-id
container_1464122327865_0002_01_000001: 1.6 GB of 6.3 GB physical memory used;
6.1 GB of 31.3 GB virtual memory used
Can you please help me figure out where to go from here to troubleshoot, or any
other things to try?
Thanks!
-Jeff