Jian He commented on YARN-2964:

thanks for your comments, Jason !

bq. I'm wondering about the change in the removeApplicationFromRenewal method 
or remove.
If launcher job first gets added to the appTokens map, DelegationTokenRenewer 
will not add DelegationTokenToRenew instance for the sub-job. So the tokens in 
removeApplicationFromRenewal will return empty for the sub-job when the sub-job 
completes. So the token won’t be removed from the allTokens. My only concern 
with a global set that is that each time an application completes, we end up 
looping all the applications or worse (each app may have at least one token).
bq. This comment doesn't match the code
good catch.. what a mistake.. I might be in the impression the semantics is 
“shouldKeepAtEnd”, I added one line in the test case to guard against this.
bq. Wonder if we should be using a Set instead of a Map to track these tokens
Thought about that too, the reason that switched to a map is to get the 
DelegationTokenToRenew instance based on the token app provided and change the 
shouldCancelAtEnd field on submission.

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-2964.1.patch
> The RM used to globally track the unique set of tokens for all apps.  It 
> remembered the first job that was submitted with the token.  The first job 
> controlled the cancellation of the token.  This prevented completion of 
> sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
> notion of the first/main job.  This results in sub-jobs canceling tokens and 
> failing the main job and other sub-jobs.  It also appears to schedule 
> multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 
> min (NM livelyness interval) after log aggregation completes.  The result is 
> an oozie job, ex. pig, that will launch many sub-jobs over time will fail if 
> any sub-jobs are launched >10 min after any sub-job completes.  If all other 
> sub-jobs complete within that 10 min window, then the issue goes unnoticed.

This message was sent by Atlassian JIRA

Reply via email to