[ 
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646515#comment-16646515
 ] 

Daryn Sharp commented on YARN-8865:
-----------------------------------

Good job.  That explains why the secret manager doesn't remove them.  What's 
interesting is secret keys are supposed to outlive their tokens.  Were secret 
keys manually deleted?  Regardless the secret manager should be able to recover 
its state.

The patch is a high risky change for a common class.  All secret managers are 
not be equipped to handle mutation during loading.  Case in point: The NN 
generates an edit to remove tokens.  Edits cannot be generated while replaying 
edits (restoring state).  Fundamentally a HA standby cannot modify state.  
Similar issues probably exist for other secret managers.

Perhaps the lowest risk change is add tokens with an invalid key anyway.  Set 
the password to null.  Authentication will fail, and should allow the 
expiration thread to correctly remove the tokens.

Or the lowest risk change is modify the RMDTSM to handle removal while 
restoring state.

> RMStateStore contains large number of expired RMDelegationToken
> ---------------------------------------------------------------
>
>                 Key: YARN-8865
>                 URL: https://issues.apache.org/jira/browse/YARN-8865
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Major
>         Attachments: YARN-8865.001.patch, YARN-8865.002.patch
>
>
> When the RM state store is restored expired delegation tokens are restored 
> and added to the system. These expired tokens do not get cleaned up or 
> removed. The exact reason why the tokens are still in the store is not clear. 
> We have seen as many as 250,000 tokens in the store some of which were 2 
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and 
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to