[ 
https://issues.apache.org/jira/browse/YARN-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965864#comment-14965864
 ] 

Jason Lowe commented on YARN-4041:
----------------------------------

Thanks for updating the patch, Sunil!

When fixing the test, why wasn't the fix in waitForTokensToBeRenewed?  Also I'm 
not thrilled with the idea of sleeping for 1 second per application and hoping 
it's enough time.  And we're getting out early when there is at least one token 
in the token set, but there's a race where we may have taken a snapshot before 
all the tokens are there.  Can't we key off the app start events coming out of 
the token renewal process to know when we're done?  Would be nice if there were 
a more reliable way so we can avoid arbitrary sleeps (which tend to slow down 
unit tests overall) and racy tests.

Also noticed on subsequent look that AbsrtactDelegationTokenRenewerAppEvent s/b 
AbstractDelegationTokenRenewerAppEvent.

> Slow delegation token renewal can severely prolong RM recovery
> --------------------------------------------------------------
>
>                 Key: YARN-4041
>                 URL: https://issues.apache.org/jira/browse/YARN-4041
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Sunil G
>         Attachments: 0001-YARN-4041.patch, 0002-YARN-4041.patch, 
> 0003-YARN-4041.patch
>
>
> When the RM does a work-preserving restart it synchronously tries to renew 
> delegation tokens for every active application.  If a token server happens to 
> be down or is running slow and a lot of the active apps were using tokens 
> from that server then it can have a huge impact on the time it takes the RM 
> to process the restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to