Tao Yang created YARN-9423:
------------------------------
Summary: Optimize AM launcher to avoid bottleneck when a large
number of AM failover happen at the same time
Key: YARN-9423
URL: https://issues.apache.org/jira/browse/YARN-9423
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 3.2.0
Reporter: Tao Yang
Assignee: Tao Yang
We have met a slow recovery for applications when many NM lost happen at the
same time:
# many NM shut down at the same time abnormally.
# NM expired, then a large number of AM start failover.
# AM containers are allocated but not launched for about half an hour.
Among this slow recovery, all ApplicationMasterLauncher threads were calling
cleanup for containers on these lost nodes and keep retrying to communicate
with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even
though RM had known these NM are lost and probably can't be connected for a
long time. Meanwhile many AM cleanup and launch operations were still waiting
in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch
operations were blocked by cleanup operations which are wasting 3 minutes. As a
result, AM failover can be a very slow journey.
I think we can optimize AM launcher in two ways:
# Modify type of ApplicationMasterLauncher#masterEvents from
LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations
in front of cleanup operations.
# Check node state first and skip cleanup AM containers on non-existent or
unusable NM (because these NM probably can't be communicated for a long time)
before communicating with NM in cleanup process(AMLauncher#cleanup).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]