[
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776341#comment-16776341
]
Weiwei Yang commented on YARN-9195:
-----------------------------------
Thanks [~ssy], [~Tao Yang]. Apologies for the delay on reviewing this, I need
to catch up a lot of changes around.
To better illustrate the problem, let's use an example
1) Initially app state (container size irrelative)
{code:java}
total 5, outstanding: 1, currentAttemptAllocated: 2, previousAttemptsAllocated:2
{code}
2) RM failover, and recover containers
current logic during register
{code:java}
outstanding -= previousAttemptsAllocated
outstanding = -1 // WRONG!
{code}
is this correct? If this is the case, I am confused why it even removes the
previous attempts containers at first place?
Back to the patch, method
\{{removePreviousContainersFromOutstandingSchedulingRequests}}, it scans all
containers from previous attempts, if container ID is not same as current ID,
remove it from outstanding request. As \{{prevContainers}} is the result of
\{{response.getContainersFromPreviousAttempts()}} which means the containers'
ID should not be same as current attemptId, why you need to compare the ID
again?
Thanks
> RM Queue's pending container number might get decreased unexpectedly or even
> become negative once RM failover
> -------------------------------------------------------------------------------------------------------------
>
> Key: YARN-9195
> URL: https://issues.apache.org/jira/browse/YARN-9195
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 3.1.0
> Reporter: MalcolmSanders
> Assignee: MalcolmSanders
> Priority: Critical
> Attachments: YARN-9195.001.patch, YARN-9195.002.patch,
> YARN-9195.003.patch, cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found
> that pending container number of one RM queue became negative after RM failed
> over. Since queues in RM are managed in hierarchical structure, the root
> queue's pending containers became negative at last, thus the scheduling
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of
> some AMs' requests became negative after RM failed over. And there are at
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to
> the numAllocations for a schedulingRequest. In our batch job scenario, the
> numAllocations of a schedulingRequest could turn to zero because
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it
> re-register itself through
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will
> removeFromOutstandingSchedulingRequests once AM gets
> ContainersFromPreviousAttempts without checking whether these containers have
> been assigned before. As a consequence, its outstanding requests might be
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the
> latest hadoop trunk, posted in the attachment. You may try case
> testAMRMClientWithNegativePendingRequestsOnRMRestart and
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during
> registerApplicationMaster, and some sanity checks are also needed to prevent
> things from getting worse.
> More comments and suggestions are welcomed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]