[ https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776341#comment-16776341 ]
Weiwei Yang commented on YARN-9195: ----------------------------------- Thanks [~ssy], [~Tao Yang]. Apologies for the delay on reviewing this, I need to catch up a lot of changes around. To better illustrate the problem, let's use an example 1) Initially app state (container size irrelative) {code:java} total 5, outstanding: 1, currentAttemptAllocated: 2, previousAttemptsAllocated:2 {code} 2) RM failover, and recover containers current logic during register {code:java} outstanding -= previousAttemptsAllocated outstanding = -1 // WRONG! {code} is this correct? If this is the case, I am confused why it even removes the previous attempts containers at first place? Back to the patch, method \{{removePreviousContainersFromOutstandingSchedulingRequests}}, it scans all containers from previous attempts, if container ID is not same as current ID, remove it from outstanding request. As \{{prevContainers}} is the result of \{{response.getContainersFromPreviousAttempts()}} which means the containers' ID should not be same as current attemptId, why you need to compare the ID again? Thanks > RM Queue's pending container number might get decreased unexpectedly or even > become negative once RM failover > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-9195 > URL: https://issues.apache.org/jira/browse/YARN-9195 > Project: Hadoop YARN > Issue Type: Bug > Components: client > Affects Versions: 3.1.0 > Reporter: MalcolmSanders > Assignee: MalcolmSanders > Priority: Critical > Attachments: YARN-9195.001.patch, YARN-9195.002.patch, > YARN-9195.003.patch, cases_to_recreate_negative_pending_requests_scenario.diff > > > Hi, all: > Previously we have encountered a serious problem in ResourceManager, we found > that pending container number of one RM queue became negative after RM failed > over. Since queues in RM are managed in hierarchical structure, the root > queue's pending containers became negative at last, thus the scheduling > process of the whole cluster became affected. > The version of both our RM server and AMRM client in our application are > based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method > in our application to request resources from RM. > After investigation, we found that the direct cause was numAllocations of > some AMs' requests became negative after RM failed over. And there are at > lease three necessary conditions: > (1) Use schedulingRequests in AMRM client, and the application set zero to > the numAllocations for a schedulingRequest. In our batch job scenario, the > numAllocations of a schedulingRequest could turn to zero because > theoretically we can run a full batch job using only one container. > (2) RM failovers. > (3) Before AM reregisters itself to RM after RM restarts, RM has already > recovered some of the application's containers assigned before. > Here are some more details about the implementation: > (1) After RM recovers, RM will send all alive containers to AM once it > re-register itself through > RegisterApplicationMasterResponse#getContainersFromPreviousAttempts. > (2) During registerApplicationMaster, AMRMClientImpl will > removeFromOutstandingSchedulingRequests once AM gets > ContainersFromPreviousAttempts without checking whether these containers have > been assigned before. As a consequence, its outstanding requests might be > decreased unexpectedly even if it may not become negative. > (3) There is no sanity check in RM to validate requests from AMs. > For better illustrating this case, I've written a test case based on the > latest hadoop trunk, posted in the attachment. You may try case > testAMRMClientWithNegativePendingRequestsOnRMRestart and > testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart . > To solve this issue, I propose to filter allocated containers before > removeFromOutstandingSchedulingRequests in AMRMClientImpl during > registerApplicationMaster, and some sanity checks are also needed to prevent > things from getting worse. > More comments and suggestions are welcomed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org