Robert Kanter commented on YARN-2964:

[~kasha] is correct.  The launcher job waits around for all actions types that 
typically submit other MR jobs (Pig, Sqoop, Hive, etc) except for the MapReduce 
action, which finishes immediately after submitting the "real" MR job.  

I just checked, and in the MR launcher, Oozie sets 
{{mapreduce.job.complete.cancel.delegation.tokens}} to {{true}} and in the 
other launchers, Oozie sets it to {{false}}.  Oozie doesn't set touch this 
property in any "real" launched MR jobs, so they'll use the default, which I'm 
guessing is {{true}}.  Though thinking about this now, it seems like these are 
backwards, so I'm not sure how that's working right....

On a related note, we did see an issue recently where a launched job that took 
over 24 hours would cause the launcher to fail with a delegation token issue 
because the token expired; even with the property explicitly set correctly.  
The problem was that {{yarn.resourcemanager.delegation.token.renew-interval}} 
was set to 24 hours (the default) and if you don't renew (or use?) a delegation 
token at least every 24 hours, then it automatically expires.  [~daryn], 
perhaps in the original issue this was set to 10 minutes?  I haven't had a 
chance to look into this, but the fix for this particular issue would be to 
have the launcher job renew the token at some interval.

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
> The RM used to globally track the unique set of tokens for all apps.  It 
> remembered the first job that was submitted with the token.  The first job 
> controlled the cancellation of the token.  This prevented completion of 
> sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
> notion of the first/main job.  This results in sub-jobs canceling tokens and 
> failing the main job and other sub-jobs.  It also appears to schedule 
> multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 
> min (NM livelyness interval) after log aggregation completes.  The result is 
> an oozie job, ex. pig, that will launch many sub-jobs over time will fail if 
> any sub-jobs are launched >10 min after any sub-job completes.  If all other 
> sub-jobs complete within that 10 min window, then the issue goes unnoticed.

This message was sent by Atlassian JIRA

Reply via email to