Krishan Goyal created YARN-11473: ------------------------------------ Summary: Create a safe mode RM service to enable DB access Key: YARN-11473 URL: https://issues.apache.org/jira/browse/YARN-11473 Project: Hadoop YARN Issue Type: Task Components: resourcemanager Reporter: Krishan Goyal Assignee: Krishan Goyal
We have seen various issues where RM fails to start due to bad state leading to exceptions on startup. Eg: https://issues.apache.org/jira/browse/YARN-2340 Another issue we have seen internally is with issues in the capacity scheduler config {noformat} org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When label=[]{noformat} In such cases, we can't recover until a bug fix is deployed to enable RM to start so that the data can be corrected. And during the time RM is forcefully brought up in those cases, RM can still serve client / AM requests & further complicate things. Ideally we should be able to fix the database independently of RM unable to startup. But with levelDB which is an embedded database this isn't possible without RM being up. Using seperate tools like [leveldb-cli|https://github.com/liderman/leveldb-cli] isn't useful always because it requires additional code to handle specific comparators etc & requires to be deployed together with RM binaries etc. A patch to delete applications from state store was implemented in https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other bad entries in state store like DTs / Master keys / App attempts / CS Conf from which we can't recover A generic DB access will be helpful to delete / update invalid keys. A better solution is to create a safe mode feature in RM which starts RM with basic functionality to enable fixing it. RM will not serve client / AM / NM requests in this mode. This mode will enable selective admin functionality only (read / write access to the state store). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org