Tao Yang created YARN-9686:
------------------------------

             Summary: Reduce visibility of blacklisted nodes information (only 
for current app attempt) to avoid the abuse of memory
                 Key: YARN-9686
                 URL: https://issues.apache.org/jira/browse/YARN-9686
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Tao Yang
            Assignee: Tao Yang


Recently we found an issue that RM did a long GC and found many WARN 
logs(Ignoring Blacklists, blacklist size 1775 is more than failure threshold 
ratio 0.20000000298023224 out of total usable nodes 1778) in RM log with a 
super high frequency about 3w+/s.
The direct cause is that a few apps with a large attempts and many blacklisted 
nodes were requested frequently via REST API or WEB UI. For every single 
request, RM should allocate new memory for blacklisted nodes for many times(N * 
NUM_ATTETMPTS).

Currently both AM(system) blacklisted nodes and app blacklisted nodes are 
transferred among app attempts and there are only one instance for each other, 
it's redundant and costly to travel all blacklisted nodes for every app 
attempt, so that I propose to get and show blacklisted nodes only for current 
app attempt to enhance performance and avoid the abuse of memory in some 
similar scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to