[ 
https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247444#comment-14247444
 ] 

Vinod Kumar Vavilapalli commented on YARN-2964:
-----------------------------------------------

I checked the code, doubt if there is a bug.

bq. The first job controlled the cancellation of the token.
Correct.

bq. This prevented completion of sub-jobs from canceling tokens used by the 
main job.
Only, partially true. More common case to avoid was the completion of the 
launcher job itself canceling tokens to be used by the sub-jobs.

bq. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no 
notion of the first/main job. This results in sub-jobs canceling tokens and 
failing the main job and other sub-jobs.
AFAIR, this code never had the concept of a first job. An app submits tokens, 
there was a flat list of tokens, everytime an app finishes, RM will check if 
the CancelTokensWhenComplete flag is set, and ignore the cancelation of this 
app if the flag is set. The token gets expired after 7 days. This continues to 
be the case even after YARN-2704.

bq. It also appears to schedule multiple redundant renewals.
Specific references?

bq. If all other sub-jobs complete within that 10 min window, then the issue 
goes unnoticed.
I doubt if this issue happens at all. Are you seeing it on a cluster or is it a 
theory? IAC, [~jianhe], we can write a test-case which proves or disproves 
this? 


> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Priority: Blocker
>
> The RM used to globally track the unique set of tokens for all apps.  It 
> remembered the first job that was submitted with the token.  The first job 
> controlled the cancellation of the token.  This prevented completion of 
> sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
> notion of the first/main job.  This results in sub-jobs canceling tokens and 
> failing the main job and other sub-jobs.  It also appears to schedule 
> multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 
> min (NM livelyness interval) after log aggregation completes.  The result is 
> an oozie job, ex. pig, that will launch many sub-jobs over time will fail if 
> any sub-jobs are launched >10 min after any sub-job completes.  If all other 
> sub-jobs complete within that 10 min window, then the issue goes unnoticed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to