Jason Lowe commented on YARN-2964:

Thanks for the patch, Jian!  Findbug warnings appear to be unrelated.

I'm wondering about the change in the removeApplicationFromRenewal method or 
remove.  If a sub-job completes, won't we remove the token from the allTokens 
map before the launcher job has completed?  Then a subsequent sub-job that 
requests token cancelation can put the token back in the map and cause the 
token to be canceled when it leaves.  I think we need to repeat the logic from 
the original code before YARN-2704 here, i.e.: only remove the token if the 
application ID matches.  That way the launcher job's token will remain _the_ 
token in that collection until the launcher job completes.

This comment doesn't match the code, since the code looks like if any token 
wants to cancel at the end then we will cancel at the end.
          // If any of the jobs sharing the same token set shouldCancelAtEnd
          // to true, we should not cancel the token.
          if (evt.shouldCancelAtEnd) {
            dttr.shouldCancelAtEnd = evt.shouldCancelAtEnd;
I think the logic and comment should be if any job doesn't want to cancel then 
we won't cancel.  The code seems to be trying to do the opposite, so I'm not 
sure how the unit test is passing.  Maybe I'm missing something.

The info log message added in handleAppSubmitEvent also is misleading, as it 
says we are setting shouldCancelAtEnd to whatever the event said, when in 
reality we only set it sometimes.  Probably needs to be inside the conditional.

Wonder if we should be using a Set instead of a Map to track these tokens.  
Adding an already existing DelegationTokenToRenew in a set will not change the 
one already there, but with the map a sub-job can clobber the 
DelegationTokenToRenew that's already there with its own when it does the 
allTokens.put(dtr.token, dtr).

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-2964.1.patch
> The RM used to globally track the unique set of tokens for all apps.  It 
> remembered the first job that was submitted with the token.  The first job 
> controlled the cancellation of the token.  This prevented completion of 
> sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
> notion of the first/main job.  This results in sub-jobs canceling tokens and 
> failing the main job and other sub-jobs.  It also appears to schedule 
> multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 
> min (NM livelyness interval) after log aggregation completes.  The result is 
> an oozie job, ex. pig, that will launch many sub-jobs over time will fail if 
> any sub-jobs are launched >10 min after any sub-job completes.  If all other 
> sub-jobs complete within that 10 min window, then the issue goes unnoticed.

This message was sent by Atlassian JIRA

Reply via email to