[
https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253502#comment-14253502
]
Advertising
Hudson commented on YARN-2964:
------------------------------
FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #47 (See
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/47/])
YARN-2964. RM prematurely cancels tokens for jobs that submit jobs (oozie).
Contributed by Jian He (jlowe: rev 0402bada1989258ecbfdc437cb339322a1f55a97)
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* hadoop-yarn-project/CHANGES.txt
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java
> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>
> Key: YARN-2964
> URL: https://issues.apache.org/jira/browse/YARN-2964
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: Daryn Sharp
> Assignee: Jian He
> Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: YARN-2964.1.patch, YARN-2964.2.patch, YARN-2964.3.patch
>
>
> The RM used to globally track the unique set of tokens for all apps. It
> remembered the first job that was submitted with the token. The first job
> controlled the cancellation of the token. This prevented completion of
> sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no
> notion of the first/main job. This results in sub-jobs canceling tokens and
> failing the main job and other sub-jobs. It also appears to schedule
> multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10
> min (NM livelyness interval) after log aggregation completes. The result is
> an oozie job, ex. pig, that will launch many sub-jobs over time will fail if
> any sub-jobs are launched >10 min after any sub-job completes. If all other
> sub-jobs complete within that 10 min window, then the issue goes unnoticed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)