[jira] [Updated] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3535: Fix Version/s: 2.6.4 > Scheduler must re-request container resources when RMContainer transitions > from ALLOCATED to KILLED > --- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057536#comment-15057536 ] zhihai xu commented on YARN-3535: - Yes, this issue exists in 2.6.x, I just committed this patch to branch-2.6. > Scheduler must re-request container resources when RMContainer transitions > from ALLOCATED to KILLED > --- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057608#comment-15057608 ] zhihai xu commented on YARN-3535: - You are welcome! I think this will be a very critical fix for 2.6.4 release. > Scheduler must re-request container resources when RMContainer transitions > from ALLOCATED to KILLED > --- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
[ https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057613#comment-15057613 ] zhihai xu commented on YARN-4440: - Committed it to trunk and branch-2. thanks [~linyiqun] for the contributions! > FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time > - > > Key: YARN-4440 > URL: https://issues.apache.org/jira/browse/YARN-4440 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: YARN-4440.001.patch, YARN-4440.002.patch, > YARN-4440.003.patch > > > It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} > method > {code} > // default level is NODE_LOCAL > if (! allowedLocalityLevel.containsKey(priority)) { > allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL); > return NodeType.NODE_LOCAL; > } > {code} > If you first invoke this method, it doesn't init time in > lastScheduledContainer and this will lead to execute these code for next > invokation: > {code} > // check waiting time > long waitTime = currentTimeMs; > if (lastScheduledContainer.containsKey(priority)) { > waitTime -= lastScheduledContainer.get(priority); > } else { > waitTime -= getStartTime(); > } > {code} > the waitTime will subtract to FsApp startTime, and this will be easily more > than the delay time and allowedLocality degrade. Because FsApp startTime will > be start earlier than currentTimeMs. So we should add the initial time of > priority to prevent comparing with FsApp startTime and allowedLocalityLevel > degrade. And this problem will have more negative influence for small-jobs. > The YARN-4399 also discuss some problem in aspect of locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
zhihai xu created YARN-4458: --- Summary: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. Key: YARN-4458 URL: https://issues.apache.org/jira/browse/YARN-4458 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Release Note: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. (was: Compilation error at branch-2.7 due to {{getNodeLabelExpression}} not defined in NMContainerStatusPBImpl.) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Attachment: YARN-4458.000.patch > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.000.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Attachment: (was: YARN-4458.000.patch) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Attachment: YARN-4458.branch-2.7.patch > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Description: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Description: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. This issue only happens for branch-2.7. (was: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. This issue only happens for branch-2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Release Note: (was: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. This issue only happens for branch-2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058256#comment-15058256 ] zhihai xu commented on YARN-4458: - Thanks [~jlowe]! yes, It makes sense, which will make cherry-pick easier. > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. This issue only happens for branch-2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3857: Affects Version/s: 2.6.2 > Memory leak in ResourceManager with SIMPLE mode > --- > > Key: YARN-3857 > URL: https://issues.apache.org/jira/browse/YARN-3857 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0, 2.6.2 >Reporter: mujunchao >Assignee: mujunchao >Priority: Critical > Labels: patch > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, > YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch > > > We register the ClientTokenMasterKey to avoid client may hold an invalid > ClientToken after RM restarts. In SIMPLE mode, we register > Pair , But we never remove it from HashMap, as > unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.
[ https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058263#comment-15058263 ] zhihai xu commented on YARN-4439: - Hi [~jianhe], Could you revert the old patch and create a new patch for branch-2.7 to fix the compilation error? > Clarify NMContainerStatus#toString method. > -- > > Key: YARN-4439 > URL: https://issues.apache.org/jira/browse/YARN-4439 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.7.3 > > Attachments: YARN-4439.1.patch, YARN-4439.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.
[ https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058460#comment-15058460 ] zhihai xu commented on YARN-4439: - Good Catch [~jlowe]! Will clean it up! thanks. > Clarify NMContainerStatus#toString method. > -- > > Key: YARN-4439 > URL: https://issues.apache.org/jira/browse/YARN-4439 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.7.3 > > Attachments: YARN-4439.1.patch, YARN-4439.2.patch, > YARN-4439.appendum-2.7.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
[ https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065066#comment-15065066 ] zhihai xu commented on YARN-4440: - yes, thanks [~leftnoteasy] for committing it to branch-2.8! > FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time > - > > Key: YARN-4440 > URL: https://issues.apache.org/jira/browse/YARN-4440 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Fix For: 2.8.0 > > Attachments: YARN-4440.001.patch, YARN-4440.002.patch, > YARN-4440.003.patch > > > It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} > method > {code} > // default level is NODE_LOCAL > if (! allowedLocalityLevel.containsKey(priority)) { > allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL); > return NodeType.NODE_LOCAL; > } > {code} > If you first invoke this method, it doesn't init time in > lastScheduledContainer and this will lead to execute these code for next > invokation: > {code} > // check waiting time > long waitTime = currentTimeMs; > if (lastScheduledContainer.containsKey(priority)) { > waitTime -= lastScheduledContainer.get(priority); > } else { > waitTime -= getStartTime(); > } > {code} > the waitTime will subtract to FsApp startTime, and this will be easily more > than the delay time and allowedLocality degrade. Because FsApp startTime will > be start earlier than currentTimeMs. So we should add the initial time of > priority to prevent comparing with FsApp startTime and allowedLocalityLevel > degrade. And this problem will have more negative influence for small-jobs. > The YARN-4399 also discuss some problem in aspect of locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.004.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082601#comment-15082601 ] zhihai xu commented on YARN-3446: - thanks for the review! Just updated the patch at YARN-3446.004.patch. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown
[ https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3697: Fix Version/s: 2.6.4 > FairScheduler: ContinuousSchedulingThread can fail to shutdown > -- > > Key: YARN-3697 > URL: https://issues.apache.org/jira/browse/YARN-3697 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3697.000.patch, YARN-3697.001.patch > > > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > The reason is because the InterruptedException is blocked in > continuousSchedulingAttempt > {code} > try { > if (node != null && Resources.fitsIn(minimumAllocation, > node.getAvailableResource())) { > attemptScheduling(node); > } > } catch (Throwable ex) { > LOG.error("Error while attempting scheduling for node " + node + > ": " + ex.toString(), ex); > } > {code} > I saw the following exception after stop: > {code} > 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) > 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] > fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - > Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 > available= used=: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContai
[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown
[ https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082641#comment-15082641 ] zhihai xu commented on YARN-3697: - [~djp], yes, I just committed it to branch-2.6. thanks > FairScheduler: ContinuousSchedulingThread can fail to shutdown > -- > > Key: YARN-3697 > URL: https://issues.apache.org/jira/browse/YARN-3697 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3697.000.patch, YARN-3697.001.patch > > > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > The reason is because the InterruptedException is blocked in > continuousSchedulingAttempt > {code} > try { > if (node != null && Resources.fitsIn(minimumAllocation, > node.getAvailableResource())) { > attemptScheduling(node); > } > } catch (Throwable ex) { > LOG.error("Error while attempting scheduling for node " + node + > ": " + ex.toString(), ex); > } > {code} > I saw the following exception after stop: > {code} > 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) > 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] > fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - > Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 > available= used=: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.re
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097605#comment-15097605 ] zhihai xu commented on YARN-3446: - Thanks for the review [~kasha]! That is a good suggestion. I attached a new patch YARN-3446.005.patch, which addressed your comments. Please review it. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.005.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, > YARN-3446.005.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098347#comment-15098347 ] zhihai xu commented on YARN-3446: - The test failures for TestClientRMTokens and TestAMAuthorizatio are not related to the patch. Both tests are passed in my local build. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, > YARN-3446.005.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler headroom calculation should exclude nodes in the blacklist
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098617#comment-15098617 ] zhihai xu commented on YARN-3446: - [~kasha], thanks for the review and committing the patch! > FairScheduler headroom calculation should exclude nodes in the blacklist > > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.9.0 > > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, > YARN-3446.005.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby
[ https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118816#comment-15118816 ] zhihai xu commented on YARN-4646: - Is this issue fixed in MAPREDUCE-6439? They have same stack trace. > AMRMClient crashed when RM transition from active to standby > > > Key: YARN-4646 > URL: https://issues.apache.org/jira/browse/YARN-4646 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > > when RM transition to standby, ApplicationMasterService#allocate() is > interrupted and the exception is passed to AM. > the following is the exception msg: > {quote} > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:258) > ... 11 more > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy35.allocate(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:274) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:237) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$
[jira] [Commented] (YARN-4502) Fix two AM containers get allocated when AM restart
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129324#comment-15129324 ] zhihai xu commented on YARN-4502: - +1 also. This patch also covers the case when a container receives RMContainerEventType.EXPIRE event at state RMContainerState.ALLOCATED, which was not covered by YARN-3535. Based on the original suggestion by [~leftnoteasy], It looks like the implementation for {{AbstractYarnScheduler#getApplicationAttempt(ApplicationAttemptId applicationAttemptId)}} is also confusing. It always returns the current application attempt even the current application attempt doesn't match the given {{applicationAttemptId}}. In contrast, {{RMAppImpl#getRMAppAttempt(ApplicationAttemptId appAttemptId)}} always returns the matched {{RMAppAttempt}}. Should we fix it in a follow-up JIRA? > Fix two AM containers get allocated when AM restart > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) gjfbndbfcjenrgccriejuvcnktllcc
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4502: Summary: gjfbndbfcjenrgccriejuvcnktllcc (was: cfjgdgcejkrbvgluuehgnkj) > gjfbndbfcjenrgccriejuvcnktllcc > -- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) cfjgdgcejkrbvgluuehgnkj
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4502: Summary: cfjgdgcejkrbvgluuehgnkj (was: Fix two AM containers get allocated when AM restart) > cfjgdgcejkrbvgluuehgnkj > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Fix two AM containers get allocated when AM restart
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4502: Summary: Fix two AM containers get allocated when AM restart (was: gjfbndbfcjenrgccriejuvcnktllcc) > Fix two AM containers get allocated when AM restart > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305 ] zhihai xu commented on YARN-4728: - Thanks for reporting this issue [~Silnov]! It looks like this issue is caused by the long timeout at two level. This issue is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work around this issue by changing the configuration values: "ipc.client.connect.max.retries.on.timeouts" (default is 45), "ipc.client.connect.timeout"(default is 2ms) and "yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms). > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170481#comment-15170481 ] zhihai xu commented on YARN-4728: - Yes, MAPREDUCE-6513 is possible, but YARN-1680 may be more possible. Because blacklisted nodes can happen easier in your environment than MAPREDUCE-6513 especially with mapreduce.job.reduce.slowstart.completedmaps=1. To see whether it is MAPREDUCE-6513 or YARN-1680, you need check the log to see wether reduce task is preempted. If reduce task is preempted and map task still can't get resource, it is MAPREDUCE-6513/MAPREDUCE-6514. Otherwise, it is YARN-1680. Even YARN-1680 is fixed, which trigger the preemption, MAPREDUCE-6513 still will happen. > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179066#comment-15179066 ] zhihai xu commented on YARN-4761: - Good Finding [~sjlee0]! the same issue could also happen for fair scheduler. we should decouple RMNode status from fair scheduler also. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182514#comment-15182514 ] zhihai xu commented on YARN-4761: - +1 for the latest patch, the test failures are not elated to the patch and one test failure is the same as YARN-4306. Will commit the patch shortly. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4761.01.patch, YARN-4761.02.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182647#comment-15182647 ] zhihai xu commented on YARN-4761: - I just committed it to trunk, branch-2, branch-2.8, branch-2.7 and branch-2.6. thanks [~sjlee0] for the contribution and thanks [~rohithsharma] for the review! > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Fix For: 2.8.0, 2.7.3, 2.6.5 > > Attachments: YARN-4761.01.patch, YARN-4761.02.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248637#comment-15248637 ] zhihai xu commented on YARN-2910: - linked YARN-2975 to this issue, It looks like we need both YARN-2910 and YARN-2975 to fix this issue completely. > FSLeafQueue can throw ConcurrentModificationException > - > > Key: YARN-2910 > URL: https://issues.apache.org/jira/browse/YARN-2910 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Labels: 2.6.1-candidate > Fix For: 2.7.0, 2.6.1 > > Attachments: FSLeafQueue_concurrent_exception.txt, > YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, > YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.7.patch, > YARN-2910.8.patch, YARN-2910.patch > > > The list that maintains the runnable and the non runnable apps are a standard > ArrayList but there is no guarantee that it will only be manipulated by one > thread in the system. This can lead to the following exception: > {noformat} > 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) > at java.util.ArrayList$Itr.next(ArrayList.java:831) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) > {noformat} > Full stack trace in the attached file. > We should guard against that by using a thread safe version from > java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.addendum.patch > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251016#comment-15251016 ] zhihai xu commented on YARN-1458: - Hi [~dwatzke], thanks for reporting this issue, I double check the code, I find one corner case which can cause this issue. Hopefully this is the only case which isn't handled. The corner case is when current memory demand for the app is Integer Overflow. If that happen, the weight will become [NaN|https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#log1p(double)] because current memory demand is a negative value. {code} weight = Math.log1p(app.getDemand().getMemory()) / Math.log(2); {code} {{getFairShareIfFixed}} will treat NaN weight same as positive weight. {{computeShare}} will always return 0 if the weight is NaN because {{share}} is NaN and {{(int)NaN}} is 0. I attached a addendum patch YARN-1458.addendum.patch, Could you verify whether this patch can fix your issue? thanks > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
zhihai xu created YARN-4979: --- Summary: FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. Key: YARN-4979 URL: https://issues.apache.org/jira/browse/YARN-4979 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.2, 2.8.0 Reporter: zhihai xu Assignee: zhihai xu FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We should only count ResourceRequest for ResourceRequest.ANY when calculate demand. Because {{hasContainerForNode}} will return false if no container request for ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} will also decrease the number of containers for ResourceRequest.ANY. This issue may cause current memory demand overflow(integer) because duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
[ https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4979: Attachment: YARN-4979.001.patch > FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. > -- > > Key: YARN-4979 > URL: https://issues.apache.org/jira/browse/YARN-4979 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0, 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4979.001.patch > > > FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We > should only count ResourceRequest for ResourceRequest.ANY when calculate > demand. > Because {{hasContainerForNode}} will return false if no container request for > ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} > will also decrease the number of containers for ResourceRequest.ANY. > This issue may cause current memory demand overflow(integer) because > duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251356#comment-15251356 ] zhihai xu commented on YARN-1458: - I think FSAppAttempt may add duplicate ResourceRequest to demand, which may cause current memory demand Integer Overflow. I created YARN-4979 to fix the wrong demand calculation issue for FSAppAttempt. The root cause may be YARN-4979. > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252087#comment-15252087 ] zhihai xu commented on YARN-1458: - Ok, no problem, you can try it at your convenience. thanks for finding this issue! > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4979) FSAppAttempt demand calculation considers demands at multiple locality levels different
[ https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297200#comment-15297200 ] zhihai xu commented on YARN-4979: - thanks [~kasha] for reviewing and committing the patch! > FSAppAttempt demand calculation considers demands at multiple locality levels > different > --- > > Key: YARN-4979 > URL: https://issues.apache.org/jira/browse/YARN-4979 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0, 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.9.0 > > Attachments: YARN-4979.001.patch > > > FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We > should only count ResourceRequest for ResourceRequest.ANY when calculate > demand. > Because {{hasContainerForNode}} will return false if no container request for > ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} > will also decrease the number of containers for ResourceRequest.ANY. > This issue may cause current memory demand overflow(integer) because > duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.002.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746455#comment-14746455 ] zhihai xu commented on YARN-3446: - Thanks [~kasha] for the reminder! I just uploaded a new patch YARN-3446.002.patch based on the latest code at trunk. Please review it. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown
[ https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805084#comment-14805084 ] zhihai xu commented on YARN-3697: - I also cherry-picked the change from trunk to branch-2.7, Since YARN-4153 was committed to branch-2.7. TestAsyncDispatcher succeeds now. > FairScheduler: ContinuousSchedulingThread can fail to shutdown > -- > > Key: YARN-3697 > URL: https://issues.apache.org/jira/browse/YARN-3697 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2 > > Attachments: YARN-3697.000.patch, YARN-3697.001.patch > > > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > The reason is because the InterruptedException is blocked in > continuousSchedulingAttempt > {code} > try { > if (node != null && Resources.fitsIn(minimumAllocation, > node.getAvailableResource())) { > attemptScheduling(node); > } > } catch (Throwable ex) { > LOG.error("Error while attempting scheduling for node " + node + > ": " + ex.toString(), ex); > } > {code} > I saw the following exception after stop: > {code} > 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) > 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] > fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - > Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 > available= used=: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transit
[jira] [Created] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
zhihai xu created YARN-4187: --- Summary: Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled. Key: YARN-4187 URL: https://issues.apache.org/jira/browse/YARN-4187 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS DelegationTokenIdentifier. The following is the exception which cause the job fail {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Component/s: security resourcemanager > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. > -- > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. This will cause HDFS token renew failure > for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} > exclude the client address in HDFS DelegationTokenIdentifier. > The following is the exception which cause the job fail > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) > at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) >
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Attachment: YARN-4187.000.patch > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. > -- > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. This will cause HDFS token renew failure > for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} > exclude the client address in HDFS DelegationTokenIdentifier. > The following is the exception which cause the job fail > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) > at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) > at sun.reflect.NativeMe
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Attachment: (was: YARN-4187.000.patch) > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. > -- > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. This will cause HDFS token renew failure > for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} > exclude the client address in HDFS DelegationTokenIdentifier. > The following is the exception which cause the job fail > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) > at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Me
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Attachment: YARN-4187.000.patch > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. > -- > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when HA is enabled. This will cause HDFS token renew failure > for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} > exclude the client address in HDFS DelegationTokenIdentifier. > The following is the exception which cause the job fail > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) > at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) > at sun.reflect.NativeMe
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS DelegationTokenIdentifier. The reason why the local address is return is when HA is enabled The following is the exception which cause the job fail {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Summary: Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled. (was: Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.) > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when RM HA is enabled. > - > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when RM HA is enabled. This will cause HDFS token renew > failure for renewer "nobody" if the rules from > {{hadoop.security.auth_to_local}} exclude the client address in HDFS > {{DelegationTokenIdentifier}}. > The reason why the local address is returned is: When HA is enabled, > "yarn.resourcemanager.address" may not be set, so the default address > "0.0.0.0:8032" will be used, Based on the following code at KerberosUtil, > the local address will be used to replace "0.0.0.0". > {code} > public static final String getServicePrincipal(String service, String > hostname) > throws UnknownHostException { > String fqdn = hostname; > if (null == fqdn || fqdn.equals("") || fqdn.equals("0.0.0.0")) { > fqdn = getLocalHostName(); > } > // convert hostname to lowercase as kerberos does not work with hostnames > // with uppercase characters. > return service + "/" + fqdn.toLowerCase(Locale.US); > } > /* Return fqdn of the current host */ > static String getLocalHostName() throws UnknownHostException { > return InetAddress.getLocalHost().getCanonicalHostName(); > } > {code} > The following is the exception which cause the job fail: > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Ser
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS {{DelegationTokenIdentifier}}. The reason why the local address is returned is: When HA is enabled, "yarn.resourcemanager.address" may not be set, so the default address "0.0.0.0:8032" will be used, Based on the following code at KerberosUtil, the local address will be used to replace "0.0.0.0". {code} public static final String getServicePrincipal(String service, String hostname) throws UnknownHostException { String fqdn = hostname; if (null == fqdn || fqdn.equals("") || fqdn.equals("0.0.0.0")) { fqdn = getLocalHostName(); } // convert hostname to lowercase as kerberos does not work with hostnames // with uppercase characters. return service + "/" + fqdn.toLowerCase(Locale.US); } /* Return fqdn of the current host */ static String getLocalHostName() throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); } {code} The following is the exception which cause the job fail: {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS {{DelegationTokenIdentifier}}. The reason why the local address is returned is: When HA is enabled, "yarn.resourcemanager.address" may not be set, so the default address "0.0.0.0:8032" will be used, Based on the following code at KerberosUtil.java, the local address will be used to replace "0.0.0.0". {code} public static final String getServicePrincipal(String service, String hostname) throws UnknownHostException { String fqdn = hostname; if (null == fqdn || fqdn.equals("") || fqdn.equals("0.0.0.0")) { fqdn = getLocalHostName(); } // convert hostname to lowercase as kerberos does not work with hostnames // with uppercase characters. return service + "/" + fqdn.toLowerCase(Locale.US); } /* Return fqdn of the current host */ static String getLocalHostName() throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); } {code} The following is the exception which cause the job fail: {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapr
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS {{DelegationTokenIdentifier}}. The reason why the local address is returned is: When HA is enabled, "yarn.resourcemanager.address" may not be set, so the default address "0.0.0.0:8032" will be used, Based on the following code at SecurityUtil.java, the local address will be used to replace "0.0.0.0". {code} private static String replacePattern(String[] components, String hostname) throws IOException { String fqdn = hostname; if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { fqdn = getLocalHostName(); } return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + components[2]; } static String getLocalHostName() throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); } public static String getServerPrincipal(String principalConfig, InetAddress addr) throws IOException { String[] components = getComponents(principalConfig); if (components == null || components.length != 3 || !components[1].equals(HOSTNAME_PATTERN)) { return principalConfig; } else { if (addr == null) { throw new IOException("Can't replace " + HOSTNAME_PATTERN + " pattern since client address is null"); } return replacePattern(components, addr.getCanonicalHostName()); } } {code} The following is the exception which cause the job fail: {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) at org.apache
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS {{DelegationTokenIdentifier}}. The reason why the local address is returned is: When HA is enabled, "yarn.resourcemanager.address" may not be set and {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so the default address "0.0.0.0:8032" will be used, Based on the following code at SecurityUtil.java, the local address will be used to replace "0.0.0.0". {code} private static String replacePattern(String[] components, String hostname) throws IOException { String fqdn = hostname; if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { fqdn = getLocalHostName(); } return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + components[2]; } static String getLocalHostName() throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); } public static String getServerPrincipal(String principalConfig, InetAddress addr) throws IOException { String[] components = getComponents(principalConfig); if (components == null || components.length != 3 || !components[1].equals(HOSTNAME_PATTERN)) { return principalConfig; } else { if (addr == null) { throw new IOException("Can't replace " + HOSTNAME_PATTERN + " pattern since client address is null"); } return replacePattern(components, addr.getCanonicalHostName()); } } {code} The following is the exception which cause the job fail: {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at or
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS {{DelegationTokenIdentifier}}. The reason why the local address is returned is: When HA is enabled, "yarn.resourcemanager.address" may not be set and {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so the default address "0.0.0.0:8032" will be used, Based on the following code at SecurityUtil.java, the local address will be used to replace "0.0.0.0". {code} private static String replacePattern(String[] components, String hostname) throws IOException { String fqdn = hostname; if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { fqdn = getLocalHostName(); } return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + components[2]; } static String getLocalHostName() throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); } public static String getServerPrincipal(String principalConfig, InetAddress addr) throws IOException { String[] components = getComponents(principalConfig); if (components == null || components.length != 3 || !components[1].equals(HOSTNAME_PATTERN)) { return principalConfig; } else { if (addr == null) { throw new IOException("Can't replace " + HOSTNAME_PATTERN + " pattern since client address is null"); } return replacePattern(components, addr.getCanonicalHostName()); } } {code} The following is the exception which cause the job fail: {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Summary: Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled. (was: Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.) > Yarn Client uses local address instead of RM address as token renewer in a > secure cluster when RM HA is enabled. > > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead RM address as token renewer in a > secure cluster when RM HA is enabled. This will cause HDFS token renew > failure for renewer "nobody" if the rules from > {{hadoop.security.auth_to_local}} exclude the client address in HDFS > {{DelegationTokenIdentifier}}. > The reason why the local address is returned is: When HA is enabled, > "yarn.resourcemanager.address" may not be set and > {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so > the default address "0.0.0.0:8032" will be used, Based on the following code > at SecurityUtil.java, the local address will be used to replace "0.0.0.0". > {code} > private static String replacePattern(String[] components, String hostname) > throws IOException { > String fqdn = hostname; > if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { > fqdn = getLocalHostName(); > } > return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + > components[2]; > } > static String getLocalHostName() throws UnknownHostException { > return InetAddress.getLocalHost().getCanonicalHostName(); > } > public static String getServerPrincipal(String principalConfig, > InetAddress addr) throws IOException { > String[] components = getComponents(principalConfig); > if (components == null || components.length != 3 > || !components[1].equals(HOSTNAME_PATTERN)) { > return principalConfig; > } else { > if (addr == null) { > throw new IOException("Can't replace " + HOSTNAME_PATTERN > + " pattern since client address is null"); > } > return replacePattern(components, addr.getCanonicalHostName()); > } > } > {code} > The following is the exception which cause the job fail: > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientPro
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Summary: Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled. (was: Yarn Client uses local client address instead of RM address as token renewer in a secure cluster when RM HA is enabled.) > Yarn Client uses local address instead of RM address as token renewer in a > secure cluster when RM HA is enabled. > > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead of RM address as token renewer in a > secure cluster when RM HA is enabled. This will cause HDFS token renew > failure for renewer "nobody" if the rules from > {{hadoop.security.auth_to_local}} exclude the client address in HDFS > {{DelegationTokenIdentifier}}. > The reason why the local address is returned is: When HA is enabled, > "yarn.resourcemanager.address" may not be set and > {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so > the default address "0.0.0.0:8032" will be used, Based on the following code > at SecurityUtil.java, the local address will be used to replace "0.0.0.0". > {code} > private static String replacePattern(String[] components, String hostname) > throws IOException { > String fqdn = hostname; > if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { > fqdn = getLocalHostName(); > } > return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + > components[2]; > } > static String getLocalHostName() throws UnknownHostException { > return InetAddress.getLocalHost().getCanonicalHostName(); > } > public static String getServerPrincipal(String principalConfig, > InetAddress addr) throws IOException { > String[] components = getComponents(principalConfig); > if (components == null || components.length != 3 > || !components[1].equals(HOSTNAME_PATTERN)) { > return principalConfig; > } else { > if (addr == null) { > throw new IOException("Can't replace " + HOSTNAME_PATTERN > + " pattern since client address is null"); > } > return replacePattern(components, addr.getCanonicalHostName()); > } > } > {code} > The following is the exception which cause the job fail: > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderP
[jira] [Updated] (YARN-4187) Yarn Client uses local client address instead of RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Summary: Yarn Client uses local client address instead of RM address as token renewer in a secure cluster when RM HA is enabled. (was: Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.) > Yarn Client uses local client address instead of RM address as token renewer > in a secure cluster when RM HA is enabled. > --- > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead of RM address as token renewer in a > secure cluster when RM HA is enabled. This will cause HDFS token renew > failure for renewer "nobody" if the rules from > {{hadoop.security.auth_to_local}} exclude the client address in HDFS > {{DelegationTokenIdentifier}}. > The reason why the local address is returned is: When HA is enabled, > "yarn.resourcemanager.address" may not be set and > {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so > the default address "0.0.0.0:8032" will be used, Based on the following code > at SecurityUtil.java, the local address will be used to replace "0.0.0.0". > {code} > private static String replacePattern(String[] components, String hostname) > throws IOException { > String fqdn = hostname; > if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { > fqdn = getLocalHostName(); > } > return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + > components[2]; > } > static String getLocalHostName() throws UnknownHostException { > return InetAddress.getLocalHost().getCanonicalHostName(); > } > public static String getServerPrincipal(String principalConfig, > InetAddress addr) throws IOException { > String[] components = getComponents(principalConfig); > if (components == null || components.length != 3 > || !components[1].equals(HOSTNAME_PATTERN)) { > return principalConfig; > } else { > if (addr == null) { > throw new IOException("Can't replace " + HOSTNAME_PATTERN > + " pattern since client address is null"); > } > return replacePattern(components, addr.getCanonicalHostName()); > } > } > {code} > The following is the exception which cause the job fail: > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.Authoriz
[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4187: Description: Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS {{DelegationTokenIdentifier}}. The reason why the local address is returned is: When HA is enabled, "yarn.resourcemanager.address" may not be set, if {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal", the default address "0.0.0.0:8032" will be used, Based on the following code at SecurityUtil.java, the local address will be used to replace "0.0.0.0". {code} private static String replacePattern(String[] components, String hostname) throws IOException { String fqdn = hostname; if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { fqdn = getLocalHostName(); } return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + components[2]; } static String getLocalHostName() throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); } public static String getServerPrincipal(String principalConfig, InetAddress addr) throws IOException { String[] components = getComponents(principalConfig); if (components == null || components.length != 3 || !components[1].equals(HOSTNAME_PATTERN)) { return principalConfig; } else { if (addr == null) { throw new IOException("Can't replace " + HOSTNAME_PATTERN + " pattern since client address is null"); } return replacePattern(components, addr.getCanonicalHostName()); } } {code} The following is the exception which cause the job fail: {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at o
[jira] [Commented] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876724#comment-14876724 ] zhihai xu commented on YARN-4187: - I attached a patch YARN-4187.000.patch which use YarnConfiguration instead of JobConf(Configuration) to call {{getSocketAddr}}, because Configuration.getSocketAddr won't support RM HA. Also we need check whether RM_HA_ID is configured, if RM_HA_ID is not configured, the first one in RM_HA_IDS will be used, otherwise it will fall back to use the client local address. > Yarn Client uses local address instead of RM address as token renewer in a > secure cluster when RM HA is enabled. > > > Key: YARN-4187 > URL: https://issues.apache.org/jira/browse/YARN-4187 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4187.000.patch > > > Yarn Client uses local address instead of RM address as token renewer in a > secure cluster when RM HA is enabled. This will cause HDFS token renew > failure for renewer "nobody" if the rules from > {{hadoop.security.auth_to_local}} exclude the client address in HDFS > {{DelegationTokenIdentifier}}. > The reason why the local address is returned is: When HA is enabled, > "yarn.resourcemanager.address" may not be set, if > {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal", > the default address "0.0.0.0:8032" will be used, Based on the following code > at SecurityUtil.java, the local address will be used to replace "0.0.0.0". > {code} > private static String replacePattern(String[] components, String hostname) > throws IOException { > String fqdn = hostname; > if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) { > fqdn = getLocalHostName(); > } > return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + > components[2]; > } > static String getLocalHostName() throws UnknownHostException { > return InetAddress.getLocalHost().getCanonicalHostName(); > } > public static String getServerPrincipal(String principalConfig, > InetAddress addr) throws IOException { > String[] components = getComponents(principalConfig); > if (components == null || components.length != 3 > || !components[1].equals(HOSTNAME_PATTERN)) { > return principalConfig; > } else { > if (addr == null) { > throw new IOException("Can't replace " + HOSTNAME_PATTERN > + " pattern since client address is null"); > } > return replacePattern(components, addr.getCanonicalHostName()); > } > } > {code} > The following is the exception which cause the job fail: > {code} > 15/09/12 16:27:24 WARN security.UserGroupInformation: > PriviledgedActionException as:t...@example.com (auth:KERBEROS) > cause:java.io.IOException: Failed to run job : yarn tries to renew a token > with renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > java.io.IOException: Failed to run job : yarn tries to renew a token with > renewer nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) > at > or
[jira] [Created] (YARN-4190) Add container information in FairScheduler preemption log to help debug.
zhihai xu created YARN-4190: --- Summary: Add container information in FairScheduler preemption log to help debug. Key: YARN-4190 URL: https://issues.apache.org/jira/browse/YARN-4190 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Add container information in FairScheduler preemption log to help debug. Currently the following log doesn't have container information {code} LOG.info("Preempting container (prio=" + container.getContainer().getPriority() + "res=" + container.getContainer().getResource() + ") from queue " + queue.getName()); {code} So it will be very difficult to debug preemption related issue for FairScheduler. Even the container information is printed in the following code {code} LOG.info("Killing container" + container + " (after waiting for premption for " + (getClock().getTime() - time) + "ms)"); {code} But we can't match these two logs based on the container ID. It will be very useful to add container information in the first log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4190) missing container information in FairScheduler preemption log.
[ https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4190: Summary: missing container information in FairScheduler preemption log. (was: Add container information in FairScheduler preemption log to help debug.) > missing container information in FairScheduler preemption log. > -- > > Key: YARN-4190 > URL: https://issues.apache.org/jira/browse/YARN-4190 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > > Add container information in FairScheduler preemption log to help debug. > Currently the following log doesn't have container information > {code} > LOG.info("Preempting container (prio=" + > container.getContainer().getPriority() + > "res=" + container.getContainer().getResource() + > ") from queue " + queue.getName()); > {code} > So it will be very difficult to debug preemption related issue for > FairScheduler. > Even the container information is printed in the following code > {code} > LOG.info("Killing container" + container + > " (after waiting for premption for " + > (getClock().getTime() - time) + "ms)"); > {code} > But we can't match these two logs based on the container ID. > It will be very useful to add container information in the first log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4190) missing container information in FairScheduler preemption log.
[ https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-4190. - Resolution: Later > missing container information in FairScheduler preemption log. > -- > > Key: YARN-4190 > URL: https://issues.apache.org/jira/browse/YARN-4190 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > > Add container information in FairScheduler preemption log to help debug. > Currently the following log doesn't have container information > {code} > LOG.info("Preempting container (prio=" + > container.getContainer().getPriority() + > "res=" + container.getContainer().getResource() + > ") from queue " + queue.getName()); > {code} > So it will be very difficult to debug preemption related issue for > FairScheduler. > Even the container information is printed in the following code > {code} > LOG.info("Killing container" + container + > " (after waiting for premption for " + > (getClock().getTime() - time) + "ms)"); > {code} > But we can't match these two logs based on the container ID. > It will be very useful to add container information in the first log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4190) missing container information in FairScheduler preemption log.
[ https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876772#comment-14876772 ] zhihai xu commented on YARN-4190: - thanks [~xinxianyin] for the information! Yes, good suggestion. resolve it as fix later, make it depend on YARN-4134. > missing container information in FairScheduler preemption log. > -- > > Key: YARN-4190 > URL: https://issues.apache.org/jira/browse/YARN-4190 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > > Add container information in FairScheduler preemption log to help debug. > Currently the following log doesn't have container information > {code} > LOG.info("Preempting container (prio=" + > container.getContainer().getPriority() + > "res=" + container.getContainer().getResource() + > ") from queue " + queue.getName()); > {code} > So it will be very difficult to debug preemption related issue for > FairScheduler. > Even the container information is printed in the following code > {code} > LOG.info("Killing container" + container + > " (after waiting for premption for " + > (getClock().getTime() - time) + "ms)"); > {code} > But we can't match these two logs based on the container ID. > It will be very useful to add container information in the first log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.003.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900067#comment-14900067 ] zhihai xu commented on YARN-3446: - Hi [~kasha], Thanks for the review! I attached a new patch YARN-3446.003.patch, which addressed your first comment. I also added more test cases to verify {{getHeadroom}} with blacklisted nodes remove and addition. About your second comment: IMHO, if we didn't do the optimization, that will be a very big overhead for a large cluster. For example, we have 2000 AM running on 5000 nodes cluster, For each AM, we need go through 5000 nodes list to find the blacklisted {{SchedulerNode}} in the heartbeat. With 2000 AM, it will loop 10,000,000 times. Normally number of blacklisted nodes should be very small for each application. So iterating on the blacklisted nodes may not be a performance issue. Also AM won't change blacklisted nodes frequently. About your third comment, it is because currently {{SchedulerNode}} are stored in {{AbstractYarnScheduler#nodes}} with key {{NodeId}}. But {{AppSchedulingInfo}} stores the blacklisted nodes using {{String}} Node Name or Rack Name. I can't find an easy way to translate Node Name and Rack Name to {{NodeId}}. So it looks like we need iterate through {{AbstractYarnScheduler#nodes}} to find the blacklisted {{SchedulerNode}} if we use {{AppSchedulingInfo#getBlacklist}}. That means for a 5000 nodes cluster, we need loop 5000 times, a big overhead. {{AbstractYarnScheduler#nodes}} are defined at the following code: {code} protected Map nodes = new ConcurrentHashMap(); {code} > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: (was: YARN-3446.003.patch) > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.003.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4095: Attachment: YARN-4095.001.patch > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900285#comment-14900285 ] zhihai xu commented on YARN-4095: - Hi [~Jason Lowe], Could you help review the patch? thanks > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900286#comment-14900286 ] zhihai xu commented on YARN-4095: - Hi [~jlowe], Could you help review the patch? thanks > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901099#comment-14901099 ] zhihai xu commented on YARN-4095: - The first patch put {{NM_GOOD_LOCAL_DIRS}} and {{NM_GOOD_LOG_DIRS}} in YarnConfiguration.java, the second patch moved them to LocalDirsHandlerService.java, since they are only used inside {{LocalDirsHandlerService}}. > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904722#comment-14904722 ] zhihai xu commented on YARN-4095: - Thanks [~jlowe] for reviewing and committing the patch! > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.8.0 > > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.001.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909043#comment-14909043 ] zhihai xu commented on YARN-3943: - I attached a new patch YARN-3943.001.patch for review. The new patch will keep backward compatibility by using the old configuration "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" as high water mark threshold for disk-full detection and creating a new configuration "yarn.nodemanager.disk-health-checker.disk-utilization-watermark-low-per-disk-percentage" as low water mark threshold for disk-not-full detection. It also makes both configurations use same default value and if low water mark threshold is more than high water mark threshold, it will be set to the same value as high water mark threshold. > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: (was: YARN-3943.001.patch) > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.001.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: (was: YARN-3943.001.patch) > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.001.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4209) RMStateStore FENCED state doesn’t work
zhihai xu created YARN-4209: --- Summary: RMStateStore FENCED state doesn’t work Key: YARN-4209 URL: https://issues.apache.org/jira/browse/YARN-4209 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical RMStateStore FENCED state doesn’t work. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even notifyStoreOperationFailed is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Attachment: YARN-4209.000.patch > RMStateStore FENCED state doesn’t work > -- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even notifyStoreOperationFailed is called. The only working case for > FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Affects Version/s: (was: 2.7.1) > RMStateStore FENCED state doesn’t work > -- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even notifyStoreOperationFailed is called. The only working case for > FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Affects Version/s: 2.7.2 > RMStateStore FENCED state doesn’t work > -- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even notifyStoreOperationFailed is called. The only working case for > FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910128#comment-14910128 ] zhihai xu commented on YARN-4209: - I attached a patch YARN-4209.000.patch which move {{updateFencedState}} from {{notifyStoreOperationFailed}} to {{StandByTransitionThread}}. So {{updateFencedState}} won't be called by {{stateMachine.doTransition}}. > RMStateStore FENCED state doesn’t work > -- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even notifyStoreOperationFailed is called. The only working case for > FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Summary: RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition (was: RMStateStore FENCED state doesn’t work) > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even notifyStoreOperationFailed is called. The only working case for > FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Description: RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by {{stateMachine.doTransition}}. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even notifyStoreOperationFailed is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. was: RMStateStore FENCED state doesn’t work. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even notifyStoreOperationFailed is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even notifyStoreOperationFailed is called. The only working case for > FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Description: RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by {{stateMachine.doTransition}}. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even {{notifyStoreOperationFailed}} is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. was: RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by {{stateMachine.doTransition}}. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even notifyStoreOperationFailed is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even {{notifyStoreOperationFailed}} is called. The only working case > for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910141#comment-14910141 ] zhihai xu commented on YARN-4209: - Also the test case in the patch can verify this issue. Without the change, the RMStateStore is still in ACTIVE state even after {{updateFencedState}} is called. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even {{notifyStoreOperationFailed}} is called. The only working case > for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Description: RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by {{stateMachine.doTransition}}. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even after {{notifyStoreOperationFailed}} is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. was: RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by {{stateMachine.doTransition}}. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even {{notifyStoreOperationFailed}} is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Attachment: (was: YARN-4209.000.patch) > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4209: Attachment: YARN-4209.000.patch > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934261#comment-14934261 ] zhihai xu commented on YARN-4209: - Hi [~jianhe], Could you help review the patch? I add lock and check {{isFencedState}} in {{StandByTransitionThread}} to make sure {{handleTransitionToStandBy}} and {{updateFencedState}} are only called once to avoid any potential race condition. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934695#comment-14934695 ] zhihai xu commented on YARN-4209: - Thanks for the review [~rohithsharma]! Yes, that is a good point! Using MultipleArcTransition will be a better solution. I will implement a new patch using MultipleArcTransition. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
[ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3727: Attachment: YARN-3727.001.patch > For better error recovery, check if the directory exists before using it for > localization. > -- > > Key: YARN-3727 > URL: https://issues.apache.org/jira/browse/YARN-3727 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3727.000.patch, YARN-3727.001.patch > > > For better error recovery, check if the directory exists before using it for > localization. > We saw the following localization failure happened due to existing cache > directories. > {code} > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, > null }, Rename cannot overwrite non empty destination directory > //8/yarn/nm/usercache//filecache/21637 > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar) > transitioned from DOWNLOADING to FAILED > {code} > The real cause for this failure may be disk failure, LevelDB operation > failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or > others. > I wonder whether we can add error recovery code to avoid the localization > failure by not using the existing cache directories for localization. > The exception happened at {{files.rename(dst_work, destDirPath, > Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after > the exception, the existing cache directory used by {{LocalizedResource}} > will be deleted. > {code} > try { > . > files.rename(dst_work, destDirPath, Rename.OVERWRITE); > } catch (Exception e) { > try { > files.delete(destDirPath, true); > } catch (IOException ignore) { > } > throw e; > } finally { > {code} > Since the conflicting local directory will be deleted after localization > failure, > I think it will be better to check if the directory exists before using it > for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
[ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935634#comment-14935634 ] zhihai xu commented on YARN-3727: - [~lichangleo], [~jlowe], thanks for the review! Yes, I uploaded a new patch YARN-3727.001.patch based on the latest code at trunk. > For better error recovery, check if the directory exists before using it for > localization. > -- > > Key: YARN-3727 > URL: https://issues.apache.org/jira/browse/YARN-3727 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3727.000.patch, YARN-3727.001.patch > > > For better error recovery, check if the directory exists before using it for > localization. > We saw the following localization failure happened due to existing cache > directories. > {code} > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, > null }, Rename cannot overwrite non empty destination directory > //8/yarn/nm/usercache//filecache/21637 > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar) > transitioned from DOWNLOADING to FAILED > {code} > The real cause for this failure may be disk failure, LevelDB operation > failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or > others. > I wonder whether we can add error recovery code to avoid the localization > failure by not using the existing cache directories for localization. > The exception happened at {{files.rename(dst_work, destDirPath, > Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after > the exception, the existing cache directory used by {{LocalizedResource}} > will be deleted. > {code} > try { > . > files.rename(dst_work, destDirPath, Rename.OVERWRITE); > } catch (Exception e) { > try { > files.delete(destDirPath, true); > } catch (IOException ignore) { > } > throw e; > } finally { > {code} > Since the conflicting local directory will be deleted after localization > failure, > I think it will be better to check if the directory exists before using it > for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
[ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3727: Attachment: YARN-3727.001.patch > For better error recovery, check if the directory exists before using it for > localization. > -- > > Key: YARN-3727 > URL: https://issues.apache.org/jira/browse/YARN-3727 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3727.000.patch, YARN-3727.001.patch > > > For better error recovery, check if the directory exists before using it for > localization. > We saw the following localization failure happened due to existing cache > directories. > {code} > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, > null }, Rename cannot overwrite non empty destination directory > //8/yarn/nm/usercache//filecache/21637 > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar) > transitioned from DOWNLOADING to FAILED > {code} > The real cause for this failure may be disk failure, LevelDB operation > failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or > others. > I wonder whether we can add error recovery code to avoid the localization > failure by not using the existing cache directories for localization. > The exception happened at {{files.rename(dst_work, destDirPath, > Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after > the exception, the existing cache directory used by {{LocalizedResource}} > will be deleted. > {code} > try { > . > files.rename(dst_work, destDirPath, Rename.OVERWRITE); > } catch (Exception e) { > try { > files.delete(destDirPath, true); > } catch (IOException ignore) { > } > throw e; > } finally { > {code} > Since the conflicting local directory will be deleted after localization > failure, > I think it will be better to check if the directory exists before using it > for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
[ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3727: Attachment: (was: YARN-3727.001.patch) > For better error recovery, check if the directory exists before using it for > localization. > -- > > Key: YARN-3727 > URL: https://issues.apache.org/jira/browse/YARN-3727 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3727.000.patch, YARN-3727.001.patch > > > For better error recovery, check if the directory exists before using it for > localization. > We saw the following localization failure happened due to existing cache > directories. > {code} > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, > null }, Rename cannot overwrite non empty destination directory > //8/yarn/nm/usercache//filecache/21637 > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar) > transitioned from DOWNLOADING to FAILED > {code} > The real cause for this failure may be disk failure, LevelDB operation > failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or > others. > I wonder whether we can add error recovery code to avoid the localization > failure by not using the existing cache directories for localization. > The exception happened at {{files.rename(dst_work, destDirPath, > Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after > the exception, the existing cache directory used by {{LocalizedResource}} > will be deleted. > {code} > try { > . > files.rename(dst_work, destDirPath, Rename.OVERWRITE); > } catch (Exception e) { > try { > files.delete(destDirPath, true); > } catch (IOException ignore) { > } > throw e; > } finally { > {code} > Since the conflicting local directory will be deleted after localization > failure, > I think it will be better to check if the directory exists before using it > for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)