[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shengyang Sha updated YARN-9195:
--------------------------------
    Description: 
Hi, all:

Previously we have encountered a serious problem in ResourceManager, we found 
that pending container number of one RM queue became negative after RM failed 
over. Since queues in RM are managed in hierarchical structure, the root 
queue's pending containers became negative at last, thus the scheduling process 
of the whole cluster became affected.

The version of both our RM server and YARN client in our application are based 
on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method in our 
application to request resources from RM.

After investigation, we found that the direct cause was numAllocations of some 
AMs' requests became negative after RM failed over. And there are at lease 
three necessary conditions:
(1) Use schedulingRequests in YARN client, and the application set zero to the 
numAllocations for a schedulingRequest. In our batch job scenario, the 
numAllocations of a schedulingRequest could turn to zero because theoretically 
we can run a full batch job using only one container.
(2) RM failovers.
(3) Before AM reregisters itself to RM after RM restarts, RM has already 
recovered some of the application's containers assigned before.

Here are some more details about the implementation:
(1) After RM recovers, RM will send all alive containers to AM once it 
re-register itself through 
RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
(2) During registerApplicationMaster, AMRMClientImpl will 
removeFromOutstandingSchedulingRequests once AM gets 
ContainersFromPreviousAttempts without checking whether these containers have 
been assigned before. As a consequence, its outstanding requests might be 
decreased unexpectedly even if it may not become negative.
(3) There is no sanity check in RM to validate requests from AMs.

For better illustrating this case, I've written a test case based on the latest 
hadoop trunk, posted in the attachment. You may try case 
testAMRMClientWithNegativePendingRequestsOnRMRestart and 
testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .

To solve this issue, I propose to filter allocated containers before 
removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
registerApplicationMaster, and some sanity checks are also needed to prevent 
things from getting worse.

More comments and suggestions are welcomed.

  was:
Hi, all:

Previously we have encountered a serious problem in ResourceManager, we found 
that pending container number of one RM queue became negative after RM failed 
over. Since queues in RM are managed in hierarchical structure, the root 
queue's pending containers became negative at last, thus the scheduling process 
of the whole cluster became affected.

The version of both our RM server and YARN client in our application are based 
on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() methods in our 
application to request resources from RM.

After investigation, we found that the direct cause was numAllocations of some 
AMs' requests became negative after RM failed over. And there are at lease 
three necessary conditions:
(1) Use schedulingRequests in YARN client, and the application set zero to the 
numAllocations for a schedulingRequest. In our batch job scenario, the 
numAllocations of a schedulingRequest could turn to zero because theoretically 
we can run a full batch job using only one container.
(2) RM failovers.
(3) Before AM reregisters itself to RM after RM restarts, RM has already 
recovered some of the application's containers assigned before.

Here are some more details about the implementation:
(1) After RM recovers, RM will send all alive containers to AM once it 
re-register itself through 
RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
(2) During registerApplicationMaster, AMRMClientImpl will 
removeFromOutstandingSchedulingRequests once AM gets 
ContainersFromPreviousAttempts without checking whether these containers have 
been assigned before. As a consequence, its outstanding requests might be 
decreased unexpectedly even if it may not become negative.
(3) There is no sanity check in RM to validate requests from AMs.

For better illustrating this case, I've written a test case based on the latest 
hadoop trunk, posted in the attachment. You may try case 
testAMRMClientWithNegativePendingRequestsOnRMRestart and 
testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .

To solve this issue, I propose to filter allocated containers before 
removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
registerApplicationMaster, and some sanity checks are also needed to prevent 
things from getting worse.

More comments and suggestions are welcomed.


> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9195
>                 URL: https://issues.apache.org/jira/browse/YARN-9195
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.1.0
>            Reporter: Shengyang Sha
>            Priority: Critical
>         Attachments: cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and YARN client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in YARN client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to