[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

Weiwei Yang (JIRA) Sun, 17 Feb 2019 09:53:19 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770479#comment-16770479
 ]


Weiwei Yang commented on YARN-9195:
-----------------------------------

Sure [~leftnoteasy], I'll help to review this. Sorry for late response, I was 
occupied on something else last week.

Thanks [~ssy] for filing the issue. Just read the patch, I am trying to 
understand {{refreshContainersFromPreviousAttempts()}}, if a container from 
previous attempt is completed, then you are not removing it from outstanding 
requests. Why are you doing this?

I am also not sure why you need to \{{initApplicationAttempt()}}, this is 
retrieving current app attempt id from AM RM token. Since in the protocol, we 
have \{{getContainersFromPreviousAttempts()}} already, what's the attempt id is 
used for here?

Thanks

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9195
>                 URL: https://issues.apache.org/jira/browse/YARN-9195
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.1.0
>            Reporter: Shengyang Sha
>            Assignee: Shengyang Sha
>            Priority: Critical
>         Attachments: YARN-9195.001.patch, YARN-9195.002.patch, 
> cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

Reply via email to