[jira] [Commented] (YARN-7003) DRAINING state of queues can't be recovered after RM restart

Weiwei Yang (JIRA) Thu, 10 May 2018 02:32:53 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470144#comment-16470144
 ]


Weiwei Yang commented on YARN-7003:
-----------------------------------

Hi [~Tao Yang]

The current approach of recovering queue state is non-optimal, which causes 
state mismatch problem like this. When queue state in conf is updated to 
{{STOPPED}}, RM internally changes to {{DRAINING}} without persisting this info 
back to the conf. Then the memory state is inconsistent with persisted state. 
There might be a few options to fix this:
 # Let RM updates queue state in conf before it updates the memory state. This 
is not a good practice that lets CS update its own conf file.
 # Remove the queue state in conf, let RM persist the queue state in its state 
store all by itself. This will need more changes but should be the most 
consistent way. But this is an incompatible change that will cause more 
problems for users.
 # As long as there are apps adding to a queue and the configured state is 
{{STOPPED}}, then reset the state to {{DRAINING}} for recovering. This is just 
like a reverse opt of what CS does today on {{STOPPED}} state.

Given #1 is not optimal, #2 is too risky, I prefer #3. This is similar to the 
patch [~Tao Yang] uploaded. Few comments
 # In CapacityScheduler, move the recovery queue state logic out from the catch 
clause to some place earlier. This should be an automatic operation instead of 
being triggered by an exception.
 # TestQueueState, can we do rm.stop then start to simulate RM restart?

Hope it makes sense. Thanks.

> DRAINING state of queues can't be recovered after RM restart
> ------------------------------------------------------------
>
>                 Key: YARN-7003
>                 URL: https://issues.apache.org/jira/browse/YARN-7003
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.9.0, 3.0.0-alpha4
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: YARN-7003.001.patch, YARN-7003.002.patch
>
>
> DRAINING state is a temporary state in RM memory, when queue state is set to 
> be STOPPED but there are still some pending or active apps in it, the queue 
> state will be changed to DRAINING instead of STOPPED after refreshing queues. 
> We've encountered the problem that the state of this queue will aways be 
> STOPPED after RM restarted, so that it can be removed at any time and leave 
> some apps in a non-existing queue.
> To fix this problem, we could recover DRAINING state in the recovery process 
> of pending/active apps. I will upload a patch with test case later for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7003) DRAINING state of queues can't be recovered after RM restart

Reply via email to