Tao Yang created YARN-9686:
------------------------------
Summary: Reduce visibility of blacklisted nodes information (only
for current app attempt) to avoid the abuse of memory
Key: YARN-9686
URL: https://issues.apache.org/jira/browse/YARN-9686
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Reporter: Tao Yang
Assignee: Tao Yang
Recently we found an issue that RM did a long GC and found many WARN
logs(Ignoring Blacklists, blacklist size 1775 is more than failure threshold
ratio 0.20000000298023224 out of total usable nodes 1778) in RM log with a
super high frequency about 3w+/s.
The direct cause is that a few apps with a large attempts and many blacklisted
nodes were requested frequently via REST API or WEB UI. For every single
request, RM should allocate new memory for blacklisted nodes for many times(N *
NUM_ATTETMPTS).
Currently both AM(system) blacklisted nodes and app blacklisted nodes are
transferred among app attempts and there are only one instance for each other,
it's redundant and costly to travel all blacklisted nodes for every app
attempt, so that I propose to get and show blacklisted nodes only for current
app attempt to enhance performance and avoid the abuse of memory in some
similar scenarios.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]