[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776516#comment-16776516
 ] 

MalcolmSanders commented on YARN-9195:
--------------------------------------

Thanks [~cheersyang] for taking efforts on reviewing this patch. In fact it's a 
really complicated case which includes several implementation details. In order 
to illustrate this case, I uploaded a patch called 
cases_to_recreate_negative_pending_requests_scenario.diff in the attachments 
which contained only the case. I thought it could help you understand the 
situation more vividly.

As to the example, after RM failover, outstanding may be decreased to zero but 
not negative. It's purely due to code implementation since it will be checked 
whether it's zero after decrement. But if the outstanding is zero before RM 
failover, it may turned negative after RM failover because after the decrement, 
outstanding is negative and also not equal to zero.

{quote}
is this correct? If this is the case, I am confused why it even removes the 
previous attempts containers at first place? 
{quote}
I think it may outstanding should be decreased by previous a attempts 
containers because there might be newly-known previous attempt coming. For 
example, consider this case:
(1) AM, RM and one of NM fails at the same time, this NM contains one container 
of this AM
(2) RM failover successfully and relaunch a new attempt of the AM while NM 
doesn't restart right now.
(3) after some time, RM failover again while that NM also starts successfully. 
While reregistering RM, AM will receive a newly-known container from previous 
attempt which the container could be used. So in this situation, outstanding 
should be decreased by containers from previous attempts.

{quote}
Back to the patch, method 
removePreviousContainersFromOutstandingSchedulingRequests, it scans all 
containers from previous attempts, if container ID is not same as current ID, 
remove it from outstanding request. As prevContainers is the result of 
response.getContainersFromPreviousAttempts() which means the containers' ID 
should not be same as current attemptId, why you need to compare the ID again?
{quote}

In fact, the method might be misleading. Once AM registerApplicationMaster, 
AbstractYarnScheduler#getTransferredContainers() in RM will return all the 
alive containers (except AM container) as containers from previous attempt. 
Notice that in case of RM failover, RM will still return all the alive 
containers while AM register itself. With respect to unmanaged AM, it's 
necessary for RM to return all the alive containers to AM because RM is not 
responsible for starting its AM container. But the other types of AMs don't 
need to get the containers their already known when RM failover, and will thus 
decrease outstanding requests in AMRMClient which is unexpected.

Previously I came up with an idea - whether we should return different result 
in AbstractYarnScheduler#getTransferredContainers() based on whether it is an 
unmanaged AM.

Pros:
We can prevent this case just by update RM with this bugfix and don't require 
applications to update its yarn client jars.

Cons:
It will make the sematics of 
RegisterApplicationMasterResponse#getContainersFromPreviousAttempts() ambiguous 
between unmanaged AM and the other types.

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9195
>                 URL: https://issues.apache.org/jira/browse/YARN-9195
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.1.0
>            Reporter: MalcolmSanders
>            Assignee: MalcolmSanders
>            Priority: Critical
>         Attachments: YARN-9195.001.patch, YARN-9195.002.patch, 
> YARN-9195.003.patch, cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to