[ 
https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154647#comment-17154647
 ] 

Jim Brennan commented on YARN-10348:
------------------------------------

The TestFairSchedulerPreemption unit test failure is unrelated to this Jira.
I believe this is ready for review.


> Allow RM to always cancel tokens after app completes
> ----------------------------------------------------
>
>                 Key: YARN-10348
>                 URL: https://issues.apache.org/jira/browse/YARN-10348
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.10.0, 3.1.3
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>         Attachments: YARN-10348.001.patch, YARN-10348.002.patch
>
>
> (Note: this change was originally done on our internal branch by [~daryn]).
> The RM currently has an option for a client to specify disabling token 
> cancellation when a job completes. This feature was an initial attempt to 
> address the use case of a job launching sub-jobs (ie. oozie launcher) and the 
> original job finishing prior to the sub-job(s) completion - ex. original job 
> completion triggered premature cancellation of tokens needed by the sub-jobs.
> Many years ago, [~daryn] added a more robust implementation to ref count 
> tokens ([YARN-3055]). This prevented premature cancellation of the token 
> until all apps using the token complete, and invalidated the need for a 
> client to specify cancel=false. Unfortunately the config option was not 
> removed.
> We have seen cases where oozie "java actions" and some users were explicitly 
> disabling token cancellation. This can lead to a buildup of defunct tokens 
> that may overwhelm the ZK buffer used by the KDC's backing store. At which 
> point the KMS fails to connect to ZK and is unable to issue/validate new 
> tokens - rendering the KDC only able to authenticate pre-existing tokens. 
> Production incidents have occurred due to the buffer size issue.
> To avoid these issues, the RM should have the option to ignore/override the 
> client's request to not cancel tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to