[
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645180#comment-16645180
]
Jason Lowe commented on YARN-8865:
----------------------------------
Thanks for the report and patch! Do we have any idea how these are getting
leaked in the first place? If I recall correctly, there's a thread pool that
periodically tries to renew tokens, and when those tokens fail to renew because
they're expired the token is removed from the state store. Therefore even upon
recovery it should try to renew these ancient tokens, fail to do so because
they're expired, then remove them from the state store. Is the state store
removal itself failing? Each secret manager is responsible for removing
expired tokens it is managing, so wondering how that is not happening here.
Rather than have each state store need to implement this feature separately,
wondering if the RMDelegationTokenSecretManager should choose not to load the
tokens in the recovered RMDTSecretManagerState that are expired and instead
immediately remove them from the state store. Otherwise every state store
needs to implement this separately which is a maintenance burden.
> RMStateStore contains large number of expired RMDelegationToken
> ---------------------------------------------------------------
>
> Key: YARN-8865
> URL: https://issues.apache.org/jira/browse/YARN-8865
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.0
> Reporter: Wilfred Spiegelenburg
> Assignee: Wilfred Spiegelenburg
> Priority: Major
> Attachments: YARN-8865.001.patch
>
>
> When the RM state store is restored expired delegation tokens are restored
> and added to the system. These expired tokens do not get cleaned up or
> removed. The exact reason why the tokens are still in the store is not clear.
> We have seen as many as 250,000 tokens in the store some of which were 2
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or
> used.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]