Shengyang Sha created YARN-9195:
-----------------------------------
Summary: RM Queue's pending container number might get decreased
unexpectedly or even become negative once RM failover
Key: YARN-9195
URL: https://issues.apache.org/jira/browse/YARN-9195
Project: Hadoop YARN
Issue Type: Bug
Components: client
Affects Versions: 3.1.0
Reporter: Shengyang Sha
Attachments: cases_to_recreate_negative_pending_requests_scenario.diff
Hi, all:
Previously we have encountered a serious problem in ResourceManager, we found
that pending container number of one RM queue became negative after RM failed
over. Since queues in RM are managed in hierarchical structure, the root
queue's pending containers became negative at last, thus the scheduling process
of the whole cluster became affected.
The version of both our RM server and YARN client in our application are based
on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() methods in our
application to request resources from RM.
After investigation, we found that the direct cause was numAllocations of some
AMs' requests became negative after RM failed over. And there are at lease
three necessary conditions:
(1) Use schedulingRequests in YARN client, and the application set zero to the
numAllocations for a schedulingRequest. In our batch job scenario, the
numAllocations of a schedulingRequest could turn to zero because theoretically
we can run a full batch job using only one container.
(2) RM failovers.
(3) Before AM reregisters itself to RM after RM restarts, RM has already
recovered some of the application's containers assigned before.
Here are some more details about the implementation:
(1) After RM recovers, RM will send all alive containers to AM once it
re-register itself through
RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
(2) During registerApplicationMaster, AMRMClientImpl will
removeFromOutstandingSchedulingRequests once AM gets
ContainersFromPreviousAttempts without checking whether these containers have
been assigned before. As a consequence, its outstanding requests might be
decreased unexpectedly even if it may not become negative.
(3) There is no sanity check in RM to validate requests from AMs.
For better illustrating this case, I've written a test case based on the latest
hadoop trunk, posted in the attachment. You may try case
testAMRMClientWithNegativePendingRequestsOnRMRestart and
testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
To solve this issue, I propose to filter allocated containers before
removeFromOutstandingSchedulingRequests in AMRMClientImpl during
registerApplicationMaster, and some sanity checks are also needed to prevent
things from getting worse.
More comments and suggestions are welcomed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]