[
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543226#comment-16543226
]
Rohith Sharma K S commented on YARN-8234:
-----------------------------------------
Thanks [~ziqian hu] for the patch and interest. I apologies for my delayed
response. The approach seems reasonable to me.
Some comments
# YarnConfiguration.java :
## Change the configuration name to
yarn.resourcemanager.system-metrics-publisher.* prefix.
## yarn.resourcemanager.timeline-server-v1.buffer-size description math seem
to be incorrect. It should be batch-size * pool-size +1.
# TimelineServiceV1Publisher.java
## entityQueue has been set with buffer size. But in putEntity method,
entityQueue.offer has been invoked. There are two problems
*** Batch size of sending evens will exceeds than configured batch size.
Because queue capacity is greater than batch-size that keeps adding into queue.
While draining what is the entityQueue size will be drained which means it
exceeds batch-size.
*** entityQueue.offer will return false if space is unavailable which means
corresponding entity is lost. We should not loose the entities rather its
better to call entityQueue.put which will wait for space to become available.
## Configuring buffer-size is going to loose entities as per my previous
comment. I think buffer size is not needed. Instead, create LinkedBlockingQueue
with capacity of dispatcherBatchSize + 1.
## serviceStop(), stop the super first and later stop internal threads.
Otherwise, events will keep sending into handler.
> Improve RM system metrics publisher's performance by pushing events to
> timeline server in batch
> -----------------------------------------------------------------------------------------------
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager, timelineserver
> Affects Versions: 2.8.3
> Reporter: Hu Ziqian
> Assignee: Hu Ziqian
> Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch,
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch,
> YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline
> server via restful api. If the cluster load is heavy, many events are sent to
> timeline server and the timeline server's event handler thread locked.
> YARN-7266 talked about the detail of this problem. Because of the lock,
> timeline server can't receive event as fast as it generated in RM and lots of
> timeline event stays in RM's memory. Finally, those events will consume all
> RM's memory and RM will start a full gc (which cause an JVM stop-world and
> cause a timeout from rm to zookeeper) or even get an OOM.
> The main problem here is that timeline can't receive timeline server's event
> as fast as it generated. Now, RM system metrics publisher put only one event
> in a request, and most time costs on handling http header or some thing about
> the net connection on timeline side. Only few time is spent on dealing with
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher
> send events to timeline server in batch via one request. When sets the batch
> size to 1000, in out experiment the speed of the timeline server receives
> events has 100x improvement. We have implement this function int our product
> environment which accepts 20000 app's in one hour and it works fine.
> We add following configuration:
> * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of
> system metrics publisher sending events in one request. Default value is 1000
> * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the
> event buffer in system metrics publisher.
> * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When
> enable batch publishing, we must avoid that the publisher waits for a batch
> to be filled up and hold events in buffer for long time. So we add another
> thread which send event's in the buffer periodically. This config sets the
> interval of the cyclical sending thread. The default value is 60s.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]