Junping Du commented on YARN-3449:

Thanks [~jlowe] for replying with comments!
I didn't quite sure about this. However, from what I learnt from the code, 
looks like we are renewing the delegation tokens in RM side for finishing Apps 
but NM still need them to do log aggregation. The way NM keep token alive for 
log aggregation is to send appTokenKeepAliveMap in heartbeat to RM and keep the 
time value updated (currentTime + 0.7~0.9 * tokenRemovalDelayMs) in every 
heartbeat request/response. If appTokenKeepAliveMap doesn't get recovered after 
NM get restarted, then NM will never add these apps in keep alive list 
(appsToCleanup only sent once by RM) and RM won't renew the token after the 
time get expired (based on last heartbeat request before NM start) because it 
won't receive any new messages from NM on these apps. 
In practical, this issues doesn't appear obviously because tokenRemovalDelayMs 
is often very large (10 minutes by default), and very few case that NM cannot 
finish log aggregation after this time (even counting NM restart time). 
However, we should still fix it because it making behavior of delegation token 
renewing inconsistent before and after NM restart (and cause bug at least 
theoretically). Isn't it?

> Recover appTokenKeepAliveMap upon nodemanager restart
> -----------------------------------------------------
>                 Key: YARN-3449
>                 URL: https://issues.apache.org/jira/browse/YARN-3449
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.6.0, 2.7.0
>            Reporter: Junping Du
>            Assignee: Junping Du
> appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application 
> alive after application is finished but NM still need app token to do log 
> aggregation (when enable security and log aggregation). 
> The applications are only inserted into this map when receiving 
> getApplicationsToCleanup() from RM heartbeat response. And RM only send this 
> info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup(). NM 
> restart work preserving should put appTokenKeepAliveMap into NMStateStore and 
> get recovered after restart. Without doing this, RM could terminate 
> application earlier, so log aggregation could be failed if security is 
> enabled.

This message was sent by Atlassian JIRA

Reply via email to