[ 
https://issues.apache.org/jira/browse/YARN-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714476#comment-17714476
 ] 

ASF GitHub Bot commented on YARN-11473:
---------------------------------------

krishan1390 opened a new pull request, #5576:
URL: https://github.com/apache/hadoop/pull/5576

   ### Description of PR
   Requirement is captured in JIRA
   
   ### Code Changes
   
   1. Create a new command and rpc method for Admin Service (& the required 
protobufs)
   2. Provide an argument to start RM in safe mode which will only start Admin 
Service
   3. Create a generic KV interface & generic LevelDB interface to access the DB
   
   ### How was this patch tested?
   1. Tested can't access the DB when RM is running.
   2. Tested basic happy cases of get / set / del with valid & invalid arguments
   3. Tested can access DB when RM can't start due to DB corruption - and can 
update DB in this case to fix RM state
   4. Tested can't access when DB is not yet inited
   5. Tested can't access if DB is being used by another RM process. Also RM 
can't start if RM is already started in safe mode & vice versa (this is because 
of level db locks)
   6. Tested Conf store keys & RM DT Master keys
   
   Unit test cases aren't really applicable here as all code changes are about 
tying together various pieces from AdminCLI to LevelDB rather than any 
functional logic
   
   
   




> Create a safe mode RM service to enable DB access
> -------------------------------------------------
>
>                 Key: YARN-11473
>                 URL: https://issues.apache.org/jira/browse/YARN-11473
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: resourcemanager
>            Reporter: Krishan Goyal
>            Assignee: Krishan Goyal
>            Priority: Major
>
> We have seen various issues where RM fails to start due to bad state leading 
> to exceptions on startup.
> Eg: https://issues.apache.org/jira/browse/YARN-2340
> Another issue we have seen internally is with issues in the capacity 
> scheduler config
> {noformat}
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
> ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity 
> setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When 
> label=[]{noformat}
> In such cases, we can't recover until a bug fix is deployed to enable RM to 
> start so that the data can be corrected. And during the time RM is forcefully 
> brought up in those cases, RM can still serve client / AM requests & further 
> complicate things. 
> Ideally we should be able to fix the database independently of RM unable to 
> startup. But with levelDB which is an embedded database this isn't possible 
> without RM being up. Using seperate tools like 
> [leveldb-cli|https://github.com/liderman/leveldb-cli] isn't useful always 
> because it requires additional code to handle specific comparators etc & 
> requires to be deployed together with RM binaries etc.  
> A patch to delete applications from state store was implemented in 
> https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other 
> bad entries in state store like DTs / Master keys / App attempts / CS Conf 
> from which we can't recover
> A generic DB access will be helpful to delete / update invalid keys. 
> A better solution is to create a safe mode feature in RM which starts RM with 
> basic functionality to enable fixing it. RM will not serve client / AM / NM 
> requests in this mode. This mode will enable selective admin functionality 
> only (read / write access to the state store). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to