[ https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470144#comment-16470144 ]
Weiwei Yang commented on YARN-7003: ----------------------------------- Hi [~Tao Yang] The current approach of recovering queue state is non-optimal, which causes state mismatch problem like this. When queue state in conf is updated to {{STOPPED}}, RM internally changes to {{DRAINING}} without persisting this info back to the conf. Then the memory state is inconsistent with persisted state. There might be a few options to fix this: # Let RM updates queue state in conf before it updates the memory state. This is not a good practice that lets CS update its own conf file. # Remove the queue state in conf, let RM persist the queue state in its state store all by itself. This will need more changes but should be the most consistent way. But this is an incompatible change that will cause more problems for users. # As long as there are apps adding to a queue and the configured state is {{STOPPED}}, then reset the state to {{DRAINING}} for recovering. This is just like a reverse opt of what CS does today on {{STOPPED}} state. Given #1 is not optimal, #2 is too risky, I prefer #3. This is similar to the patch [~Tao Yang] uploaded. Few comments # In CapacityScheduler, move the recovery queue state logic out from the catch clause to some place earlier. This should be an automatic operation instead of being triggered by an exception. # TestQueueState, can we do rm.stop then start to simulate RM restart? Hope it makes sense. Thanks. > DRAINING state of queues can't be recovered after RM restart > ------------------------------------------------------------ > > Key: YARN-7003 > URL: https://issues.apache.org/jira/browse/YARN-7003 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 2.9.0, 3.0.0-alpha4 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > Attachments: YARN-7003.001.patch, YARN-7003.002.patch > > > DRAINING state is a temporary state in RM memory, when queue state is set to > be STOPPED but there are still some pending or active apps in it, the queue > state will be changed to DRAINING instead of STOPPED after refreshing queues. > We've encountered the problem that the state of this queue will aways be > STOPPED after RM restarted, so that it can be removed at any time and leave > some apps in a non-existing queue. > To fix this problem, we could recover DRAINING state in the recovery process > of pending/active apps. I will upload a patch with test case later for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org