[
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhankun Tang updated YARN-8234:
-------------------------------
Target Version/s: 3.1.4 (was: 3.1.3)
Bulk update: Preparing for 3.1.3 release. Moved all 3.1.3 non-blocker issues to
3.1.4, please move back if it is a blocker.
> Improve RM system metrics publisher's performance by pushing events to
> timeline server in batch
> -----------------------------------------------------------------------------------------------
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager, timelineserver
> Affects Versions: 2.8.3
> Reporter: Hu Ziqian
> Assignee: Hu Ziqian
> Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch,
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch,
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch,
> YARN-8234.003.patch, YARN-8234.004.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline
> server via restful api. If the cluster load is heavy, many events are sent to
> timeline server and the timeline server's event handler thread locked.
> YARN-7266 talked about the detail of this problem. Because of the lock,
> timeline server can't receive event as fast as it generated in RM and lots of
> timeline event stays in RM's memory. Finally, those events will consume all
> RM's memory and RM will start a full gc (which cause an JVM stop-world and
> cause a timeout from rm to zookeeper) or even get an OOM.
> The main problem here is that timeline can't receive timeline server's event
> as fast as it generated. Now, RM system metrics publisher put only one event
> in a request, and most time costs on handling http header or some thing about
> the net connection on timeline side. Only few time is spent on dealing with
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher
> send events to timeline server in batch via one request. When sets the batch
> size to 1000, in out experiment the speed of the timeline server receives
> events has 100x improvement. We have implement this function int our product
> environment which accepts 20000 app's in one hour and it works fine.
> We add following configuration:
> * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of
> system metrics publisher sending events in one request. Default value is 1000
> * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the
> event buffer in system metrics publisher.
> * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When
> enable batch publishing, we must avoid that the publisher waits for a batch
> to be filled up and hold events in buffer for long time. So we add another
> thread which send event's in the buffer periodically. This config sets the
> interval of the cyclical sending thread. The default value is 60s.
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]