Jian He commented on YARN-2964:

bq. we did see an issue recently where a launched job that took over 24 hours 
would cause the launcher to fail with a delegation token issue because the 
token expired;
This is because the token is removed from RM DelegationTokenRenewer even though 
the flag is set to false. Hence, RM won't renew the token. This will cause ooze 
job to fail after 24 hrs, which should be an existing issue. I'm working on a 
patch to fix this no worse than before. The patch is based on the assumption 
that launcher job waits for all actions to complete. 

In addition, I think it may make sense for oozie to propagate  this flag to 
other actions also.  Or we can take another solution to have an application 
group Id to indicate a group of applications like oozie case and tie the token 
lifetime with the group, and drop this flag completely. 

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
> The RM used to globally track the unique set of tokens for all apps.  It 
> remembered the first job that was submitted with the token.  The first job 
> controlled the cancellation of the token.  This prevented completion of 
> sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
> notion of the first/main job.  This results in sub-jobs canceling tokens and 
> failing the main job and other sub-jobs.  It also appears to schedule 
> multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 
> min (NM livelyness interval) after log aggregation completes.  The result is 
> an oozie job, ex. pig, that will launch many sub-jobs over time will fail if 
> any sub-jobs are launched >10 min after any sub-job completes.  If all other 
> sub-jobs complete within that 10 min window, then the issue goes unnoticed.

This message was sent by Atlassian JIRA

Reply via email to