Krishan Goyal created YARN-11473:
------------------------------------

             Summary: Create a safe mode RM service to enable DB access
                 Key: YARN-11473
                 URL: https://issues.apache.org/jira/browse/YARN-11473
             Project: Hadoop YARN
          Issue Type: Task
          Components: resourcemanager
            Reporter: Krishan Goyal
            Assignee: Krishan Goyal


We have seen various issues where RM fails to start due to bad state leading to 
exceptions on startup.

Eg: https://issues.apache.org/jira/browse/YARN-2340

Another issue we have seen internally is with issues in the capacity scheduler 
config
{noformat}
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity 
setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When 
label=[]{noformat}
In such cases, we can't recover until a bug fix is deployed to enable RM to 
start so that the data can be corrected. And during the time RM is forcefully 
brought up in those cases, RM can still serve client / AM requests & further 
complicate things. 

Ideally we should be able to fix the database independently of RM unable to 
startup. But with levelDB which is an embedded database this isn't possible 
without RM being up. Using seperate tools like 
[leveldb-cli|https://github.com/liderman/leveldb-cli] isn't useful always 
because it requires additional code to handle specific comparators etc & 
requires to be deployed together with RM binaries etc.  

A patch to delete applications from state store was implemented in 
https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other 
bad entries in state store like DTs / Master keys / App attempts / CS Conf from 
which we can't recover

A generic DB access will be helpful to delete / update invalid keys. 

A better solution is to create a safe mode feature in RM which starts RM with 
basic functionality to enable fixing it. RM will not serve client / AM / NM 
requests in this mode. This mode will enable selective admin functionality only 
(read / write access to the state store). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to