[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751887#comment-16751887
 ] 

Wangda Tan commented on YARN-9195:
----------------------------------

Thanks [~ssy],  

Could u rename the patch to YARN-9175.001.patch? (According to 
[https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute]).  

And once you upload the patch, you can change the Jira to "Patch Available" so 
Jenkins will run UT. 

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9195
>                 URL: https://issues.apache.org/jira/browse/YARN-9195
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.1.0
>            Reporter: Shengyang Sha
>            Priority: Critical
>         Attachments: 
> cases_to_recreate_negative_pending_requests_scenario.diff, 
> patch.YARN-9195.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to