[jira] [Updated] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-12-14 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3535:

Fix Version/s: 2.6.4

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-12-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057536#comment-15057536
 ] 

zhihai xu commented on YARN-3535:
-

Yes, this issue exists in 2.6.x, I just committed this patch to branch-2.6.

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057608#comment-15057608
 ] 

zhihai xu commented on YARN-3535:
-

You are welcome! I think this will be a very critical fix for 2.6.4 release.

> Scheduler must re-request container resources when RMContainer transitions 
> from ALLOCATED to KILLED
> ---
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057613#comment-15057613
 ] 

zhihai xu commented on YARN-4440:
-

Committed it to trunk and branch-2. thanks [~linyiqun] for the contributions!

> FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
> -
>
> Key: YARN-4440
> URL: https://issues.apache.org/jira/browse/YARN-4440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Attachments: YARN-4440.001.patch, YARN-4440.002.patch, 
> YARN-4440.003.patch
>
>
> It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} 
> method
> {code}
> // default level is NODE_LOCAL
> if (! allowedLocalityLevel.containsKey(priority)) {
>   allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
>   return NodeType.NODE_LOCAL;
> }
> {code}
> If you first invoke this method, it doesn't init  time in 
> lastScheduledContainer and this will lead to execute these code for next 
> invokation:
> {code}
> // check waiting time
> long waitTime = currentTimeMs;
> if (lastScheduledContainer.containsKey(priority)) {
>   waitTime -= lastScheduledContainer.get(priority);
> } else {
>   waitTime -= getStartTime();
> }
> {code}
> the waitTime will subtract to FsApp startTime, and this will be easily more 
> than the delay time and allowedLocality degrade. Because FsApp startTime will 
> be start earlier than currentTimeMs. So we should add the initial time of 
> priority to prevent comparing with FsApp startTime and allowedLocalityLevel 
> degrade. And this problem will have more negative influence for small-jobs. 
> The YARN-4399 also discuss some problem in aspect of locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-4458:
---

 Summary: Compilation error at branch-2.7 due to 
getNodeLabelExpression not defined in NMContainerStatusPBImpl.
 Key: YARN-4458
 URL: https://issues.apache.org/jira/browse/YARN-4458
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Release Note: Compilation error at branch-2.7 due to getNodeLabelExpression 
not defined in NMContainerStatusPBImpl.  (was: Compilation error at branch-2.7 
due to {{getNodeLabelExpression}} not defined in NMContainerStatusPBImpl.)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Attachment: YARN-4458.000.patch

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Attachment: (was: YARN-4458.000.patch)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Attachment: YARN-4458.branch-2.7.patch

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Description: Compilation error at branch-2.7 due to getNodeLabelExpression 
not defined in NMContainerStatusPBImpl.

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Description: Compilation error at branch-2.7 due to getNodeLabelExpression 
not defined in NMContainerStatusPBImpl. This issue only happens for branch-2.7. 
 (was: Compilation error at branch-2.7 due to getNodeLabelExpression not 
defined in NMContainerStatusPBImpl.)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl. This issue only happens for branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4458:

Release Note:   (was: Compilation error at branch-2.7 due to 
getNodeLabelExpression not defined in NMContainerStatusPBImpl.)

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl. This issue only happens for branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058256#comment-15058256
 ] 

zhihai xu commented on YARN-4458:
-

Thanks [~jlowe]! yes, It makes sense, which will make cherry-pick easier.

> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl.
> -
>
> Key: YARN-4458
> URL: https://issues.apache.org/jira/browse/YARN-4458
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4458.branch-2.7.patch
>
>
> Compilation error at branch-2.7 due to getNodeLabelExpression not defined in 
> NMContainerStatusPBImpl. This issue only happens for branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-12-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3857:

Affects Version/s: 2.6.2

> Memory leak in ResourceManager with SIMPLE mode
> ---
>
> Key: YARN-3857
> URL: https://issues.apache.org/jira/browse/YARN-3857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0, 2.6.2
>Reporter: mujunchao
>Assignee: mujunchao
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
> YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch
>
>
>  We register the ClientTokenMasterKey to avoid client may hold an invalid 
> ClientToken after RM restarts. In SIMPLE mode, we register 
> Pair ,  But we never remove it from HashMap, as 
> unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058263#comment-15058263
 ] 

zhihai xu commented on YARN-4439:
-

Hi [~jianhe], Could you revert the old patch and create a new patch for 
branch-2.7 to fix the compilation error?

> Clarify NMContainerStatus#toString method.
> --
>
> Key: YARN-4439
> URL: https://issues.apache.org/jira/browse/YARN-4439
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.7.3
>
> Attachments: YARN-4439.1.patch, YARN-4439.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.

2015-12-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058460#comment-15058460
 ] 

zhihai xu commented on YARN-4439:
-

Good Catch [~jlowe]! Will clean it up! thanks.

> Clarify NMContainerStatus#toString method.
> --
>
> Key: YARN-4439
> URL: https://issues.apache.org/jira/browse/YARN-4439
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.7.3
>
> Attachments: YARN-4439.1.patch, YARN-4439.2.patch, 
> YARN-4439.appendum-2.7.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time

2015-12-18 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065066#comment-15065066
 ] 

zhihai xu commented on YARN-4440:
-

yes, thanks [~leftnoteasy] for committing it to branch-2.8!

> FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
> -
>
> Key: YARN-4440
> URL: https://issues.apache.org/jira/browse/YARN-4440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Fix For: 2.8.0
>
> Attachments: YARN-4440.001.patch, YARN-4440.002.patch, 
> YARN-4440.003.patch
>
>
> It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} 
> method
> {code}
> // default level is NODE_LOCAL
> if (! allowedLocalityLevel.containsKey(priority)) {
>   allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
>   return NodeType.NODE_LOCAL;
> }
> {code}
> If you first invoke this method, it doesn't init  time in 
> lastScheduledContainer and this will lead to execute these code for next 
> invokation:
> {code}
> // check waiting time
> long waitTime = currentTimeMs;
> if (lastScheduledContainer.containsKey(priority)) {
>   waitTime -= lastScheduledContainer.get(priority);
> } else {
>   waitTime -= getStartTime();
> }
> {code}
> the waitTime will subtract to FsApp startTime, and this will be easily more 
> than the delay time and allowedLocality degrade. Because FsApp startTime will 
> be start earlier than currentTimeMs. So we should add the initial time of 
> priority to prevent comparing with FsApp startTime and allowedLocalityLevel 
> degrade. And this problem will have more negative influence for small-jobs. 
> The YARN-4399 also discuss some problem in aspect of locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-04 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.004.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-04 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082601#comment-15082601
 ] 

zhihai xu commented on YARN-3446:
-

thanks for the review! Just updated the patch at YARN-3446.004.patch.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown

2016-01-05 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3697:

Fix Version/s: 2.6.4

> FairScheduler: ContinuousSchedulingThread can fail to shutdown
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3697.000.patch, YARN-3697.001.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContai

[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown

2016-01-05 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082641#comment-15082641
 ] 

zhihai xu commented on YARN-3697:
-

[~djp], yes, I just committed it to branch-2.6. thanks

> FairScheduler: ContinuousSchedulingThread can fail to shutdown
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: YARN-3697.000.patch, YARN-3697.001.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.re

[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-13 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097605#comment-15097605
 ] 

zhihai xu commented on YARN-3446:
-

Thanks for the review [~kasha]! That is a good suggestion. I attached a new 
patch YARN-3446.005.patch, which addressed your comments. Please review it.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-13 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.005.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, 
> YARN-3446.005.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2016-01-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098347#comment-15098347
 ] 

zhihai xu commented on YARN-3446:
-

The test failures for TestClientRMTokens and TestAMAuthorizatio are not related 
to the patch. Both tests are passed in my local build.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, 
> YARN-3446.005.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler headroom calculation should exclude nodes in the blacklist

2016-01-14 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098617#comment-15098617
 ] 

zhihai xu commented on YARN-3446:
-

[~kasha], thanks for the review and committing the patch!

> FairScheduler headroom calculation should exclude nodes in the blacklist
> 
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.9.0
>
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, 
> YARN-3446.005.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby

2016-01-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118816#comment-15118816
 ] 

zhihai xu commented on YARN-4646:
-

Is this issue fixed in MAPREDUCE-6439? They have same stack trace.

> AMRMClient crashed when RM transition from active to standby
> 
>
> Key: YARN-4646
> URL: https://issues.apache.org/jira/browse/YARN-4646
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>
> when RM transition to standby, ApplicationMasterService#allocate() is 
> interrupted and the exception is passed to AM.
> the following is the exception msg: 
> {quote}
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:258)
> ... 11 more
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
> at com.sun.proxy.$Proxy35.allocate(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:274)
> at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:237)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException):
>  java.lang.InterruptedException
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$

[jira] [Commented] (YARN-4502) Fix two AM containers get allocated when AM restart

2016-02-02 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129324#comment-15129324
 ] 

zhihai xu commented on YARN-4502:
-

+1 also. This patch also covers the case when a container receives 
RMContainerEventType.EXPIRE event at state RMContainerState.ALLOCATED, which 
was not covered by YARN-3535.
Based on the original suggestion by [~leftnoteasy], It looks like the 
implementation for 
{{AbstractYarnScheduler#getApplicationAttempt(ApplicationAttemptId 
applicationAttemptId)}} is also confusing. It always returns the current 
application attempt even the current application attempt doesn't match the 
given {{applicationAttemptId}}.
In contrast, {{RMAppImpl#getRMAppAttempt(ApplicationAttemptId appAttemptId)}} 
always returns the matched {{RMAppAttempt}}.
Should we fix it in a follow-up JIRA?

> Fix two AM containers get allocated when AM restart
> ---
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4502) gjfbndbfcjenrgccriejuvcnktllcc

2016-02-02 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4502:

Summary: gjfbndbfcjenrgccriejuvcnktllcc  (was: cfjgdgcejkrbvgluuehgnkj)

> gjfbndbfcjenrgccriejuvcnktllcc
> --
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4502) cfjgdgcejkrbvgluuehgnkj

2016-02-02 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4502:

Summary: cfjgdgcejkrbvgluuehgnkj  (was: Fix two AM containers get allocated 
when AM restart)

> cfjgdgcejkrbvgluuehgnkj
> ---
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4502) Fix two AM containers get allocated when AM restart

2016-02-02 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4502:

Summary: Fix two AM containers get allocated when AM restart  (was: 
gjfbndbfcjenrgccriejuvcnktllcc)

> Fix two AM containers get allocated when AM restart
> ---
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt
>
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 6 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_02
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:
> Total number of containers :2
> Container-Id Start Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa
> container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 
>   N/A RUNNINGxxx:25454   http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_01. But AM was not launched on 
> container_e12_1450825622869_0001_02_01. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_02 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305
 ] 

zhihai xu commented on YARN-4728:
-

Thanks for reporting this issue [~Silnov]! 
It looks like this issue is caused by the long timeout at two level. This issue 
is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work 
around this issue by changing the configuration values: 
"ipc.client.connect.max.retries.on.timeouts" (default is 45),  
"ipc.client.connect.timeout"(default is 2ms) and 
"yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms).

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-27 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170481#comment-15170481
 ] 

zhihai xu commented on YARN-4728:
-

Yes, MAPREDUCE-6513 is possible, but YARN-1680 may be more possible. Because 
blacklisted nodes can happen easier in your environment than MAPREDUCE-6513 
especially with mapreduce.job.reduce.slowstart.completedmaps=1. To see whether 
it is MAPREDUCE-6513 or YARN-1680, you need check the log to see wether reduce 
task is preempted. If reduce task is preempted and map task still can't get 
resource, it is MAPREDUCE-6513/MAPREDUCE-6514. Otherwise, it is YARN-1680. Even 
YARN-1680 is fixed, which trigger the preemption, MAPREDUCE-6513 still will 
happen.

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-03 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179066#comment-15179066
 ] 

zhihai xu commented on YARN-4761:
-

Good Finding [~sjlee0]! the same issue could also happen for fair scheduler. we 
should decouple RMNode status from fair scheduler also.

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182514#comment-15182514
 ] 

zhihai xu commented on YARN-4761:
-

+1 for the latest patch, the test failures are not elated to the patch and one 
test failure is the same as YARN-4306. Will commit the patch shortly.

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4761.01.patch, YARN-4761.02.patch
>
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler

2016-03-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182647#comment-15182647
 ] 

zhihai xu commented on YARN-4761:
-

I just committed it to trunk, branch-2, branch-2.8, branch-2.7 and branch-2.6. 
thanks [~sjlee0] for the contribution and thanks [~rohithsharma] for the review!

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations on fair scheduler
> 
>
> Key: YARN-4761
> URL: https://issues.apache.org/jira/browse/YARN-4761
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.4
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Fix For: 2.8.0, 2.7.3, 2.6.5
>
> Attachments: YARN-4761.01.patch, YARN-4761.02.patch
>
>
> YARN-3802 uncovered an issue with the scheduler where the resource 
> calculation can be incorrect due to async event handling. It was subsequently 
> fixed by YARN-4344, but it was never fixed for the fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException

2016-04-19 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248637#comment-15248637
 ] 

zhihai xu commented on YARN-2910:
-

linked YARN-2975 to this issue, It looks like we need both YARN-2910 and 
YARN-2975 to fix this issue completely.

> FSLeafQueue can throw ConcurrentModificationException
> -
>
> Key: YARN-2910
> URL: https://issues.apache.org/jira/browse/YARN-2910
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0, 2.6.1
>
> Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.7.patch, 
> YARN-2910.8.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:

Attachment: YARN-1458.addendum.patch

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251016#comment-15251016
 ] 

zhihai xu commented on YARN-1458:
-

Hi [~dwatzke], thanks for reporting this issue, I double check the code, I find 
one corner case which can cause this issue. Hopefully this is the only case 
which isn't handled.
The corner case is when current memory demand for the app is Integer Overflow. 
If that happen, the weight will become 
[NaN|https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#log1p(double)]
 because current memory demand is a negative value.
{code}
weight = Math.log1p(app.getDemand().getMemory()) / Math.log(2); 
{code}
{{getFairShareIfFixed}} will treat NaN weight same as positive weight.
{{computeShare}} will always return 0 if the weight is NaN because {{share}} is 
NaN and {{(int)NaN}} is 0.
I attached a addendum patch YARN-1458.addendum.patch, Could you verify whether 
this patch can fix your issue?
thanks



> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.

2016-04-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-4979:
---

 Summary: FSAppAttempt adds duplicate ResourceRequest to demand in 
updateDemand.
 Key: YARN-4979
 URL: https://issues.apache.org/jira/browse/YARN-4979
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.2, 2.8.0
Reporter: zhihai xu
Assignee: zhihai xu


FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
should only count ResourceRequest for ResourceRequest.ANY when calculate demand.
Because {{hasContainerForNode}} will return false if no container request for 
ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
will also decrease the number of containers for ResourceRequest.ANY.
This issue may cause current memory demand overflow(integer) because duplicate 
requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.

2016-04-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4979:

Attachment: YARN-4979.001.patch

> FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
> --
>
> Key: YARN-4979
> URL: https://issues.apache.org/jira/browse/YARN-4979
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.0, 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4979.001.patch
>
>
> FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
> should only count ResourceRequest for ResourceRequest.ANY when calculate 
> demand.
> Because {{hasContainerForNode}} will return false if no container request for 
> ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
> will also decrease the number of containers for ResourceRequest.ANY.
> This issue may cause current memory demand overflow(integer) because 
> duplicate requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251356#comment-15251356
 ] 

zhihai xu commented on YARN-1458:
-

I think FSAppAttempt may add duplicate ResourceRequest to demand, which may 
cause current memory demand Integer Overflow. I created YARN-4979 to fix the 
wrong demand calculation issue for FSAppAttempt. The root cause may be 
YARN-4979.

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2016-04-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252087#comment-15252087
 ] 

zhihai xu commented on YARN-1458:
-

Ok, no problem, you can try it at your convenience. thanks for finding this 
issue!

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.addendum.patch, YARN-1458.alternative0.patch, 
> YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, 
> yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4979) FSAppAttempt demand calculation considers demands at multiple locality levels different

2016-05-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297200#comment-15297200
 ] 

zhihai xu commented on YARN-4979:
-

thanks [~kasha] for reviewing and committing the patch!

> FSAppAttempt demand calculation considers demands at multiple locality levels 
> different
> ---
>
> Key: YARN-4979
> URL: https://issues.apache.org/jira/browse/YARN-4979
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.0, 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.9.0
>
> Attachments: YARN-4979.001.patch
>
>
> FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
> should only count ResourceRequest for ResourceRequest.ANY when calculate 
> demand.
> Because {{hasContainerForNode}} will return false if no container request for 
> ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
> will also decrease the number of containers for ResourceRequest.ANY.
> This issue may cause current memory demand overflow(integer) because 
> duplicate requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.002.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746455#comment-14746455
 ] 

zhihai xu commented on YARN-3446:
-

Thanks [~kasha] for the reminder! I just uploaded a new patch 
YARN-3446.002.patch based on the latest code at trunk. Please review it.

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown

2015-09-17 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805084#comment-14805084
 ] 

zhihai xu commented on YARN-3697:
-

I also cherry-picked the change from trunk to branch-2.7, Since YARN-4153 was 
committed to branch-2.7. TestAsyncDispatcher succeeds now.

> FairScheduler: ContinuousSchedulingThread can fail to shutdown
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: YARN-3697.000.patch, YARN-3697.001.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transit

[jira] [Created] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)
zhihai xu created YARN-4187:
---

 Summary: Yarn Client uses local address instead RM address as 
token renewer in a secure cluster when HA is enabled.
 Key: YARN-4187
 URL: https://issues.apache.org/jira/browse/YARN-4187
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS DelegationTokenIdentifier.
The following is the exception which cause the job fail
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Component/s: security
 resourcemanager

> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled.
> --
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled. This will cause HDFS token renew failure 
> for renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} 
> exclude the client address in HDFS DelegationTokenIdentifier.
> The following is the exception which cause the job fail
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
> at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
> 

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Attachment: YARN-4187.000.patch

> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled.
> --
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled. This will cause HDFS token renew failure 
> for renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} 
> exclude the client address in HDFS DelegationTokenIdentifier.
> The following is the exception which cause the job fail
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
> at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
> at sun.reflect.NativeMe

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Attachment: (was: YARN-4187.000.patch)

> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled.
> --
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
>
> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled. This will cause HDFS token renew failure 
> for renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} 
> exclude the client address in HDFS DelegationTokenIdentifier.
> The following is the exception which cause the job fail
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
> at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Me

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Attachment: YARN-4187.000.patch

> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled.
> --
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when HA is enabled. This will cause HDFS token renew failure 
> for renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} 
> exclude the client address in HDFS DelegationTokenIdentifier.
> The following is the exception which cause the job fail
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
> at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
> at sun.reflect.NativeMe

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS DelegationTokenIdentifier. The reason why the local 
address is return is when HA is enabled 
The following is the exception which cause the job fail
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Summary: Yarn Client uses local address instead RM address as token renewer 
in a secure cluster when RM HA is enabled.  (was: Yarn Client uses local 
address instead RM address as token renewer in a secure cluster when HA is 
enabled.)

> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when RM HA is enabled.
> -
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when RM HA is enabled. This will cause HDFS token renew 
> failure for renewer "nobody"  if the rules from 
> {{hadoop.security.auth_to_local}} exclude the client address in HDFS 
> {{DelegationTokenIdentifier}}.
> The reason why the local address is returned is: When HA is enabled, 
> "yarn.resourcemanager.address" may not be set, so the default address 
> "0.0.0.0:8032" will be used,  Based on the following code at KerberosUtil, 
> the local address will be used to replace "0.0.0.0".
> {code}
>   public static final String getServicePrincipal(String service, String 
> hostname)
>   throws UnknownHostException {
> String fqdn = hostname;
> if (null == fqdn || fqdn.equals("") || fqdn.equals("0.0.0.0")) {
>   fqdn = getLocalHostName();
> }
> // convert hostname to lowercase as kerberos does not work with hostnames
> // with uppercase characters.
> return service + "/" + fqdn.toLowerCase(Locale.US);
>   }
>   /* Return fqdn of the current host */
>   static String getLocalHostName() throws UnknownHostException {
> return InetAddress.getLocalHost().getCanonicalHostName();
>   }
> {code}
> The following is the exception which cause the job fail:
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Ser

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when RM HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS {{DelegationTokenIdentifier}}.
The reason why the local address is returned is: When HA is enabled, 
"yarn.resourcemanager.address" may not be set, so the default address 
"0.0.0.0:8032" will be used,  Based on the following code at KerberosUtil, the 
local address will be used to replace "0.0.0.0".
{code}
  public static final String getServicePrincipal(String service, String 
hostname)
  throws UnknownHostException {
String fqdn = hostname;
if (null == fqdn || fqdn.equals("") || fqdn.equals("0.0.0.0")) {
  fqdn = getLocalHostName();
}
// convert hostname to lowercase as kerberos does not work with hostnames
// with uppercase characters.
return service + "/" + fqdn.toLowerCase(Locale.US);
  }
  /* Return fqdn of the current host */
  static String getLocalHostName() throws UnknownHostException {
return InetAddress.getLocalHost().getCanonicalHostName();
  }
{code}
The following is the exception which cause the job fail:
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when RM HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS {{DelegationTokenIdentifier}}.
The reason why the local address is returned is: When HA is enabled, 
"yarn.resourcemanager.address" may not be set, so the default address 
"0.0.0.0:8032" will be used,  Based on the following code at KerberosUtil.java, 
the local address will be used to replace "0.0.0.0".
{code}
  public static final String getServicePrincipal(String service, String 
hostname)
  throws UnknownHostException {
String fqdn = hostname;
if (null == fqdn || fqdn.equals("") || fqdn.equals("0.0.0.0")) {
  fqdn = getLocalHostName();
}
// convert hostname to lowercase as kerberos does not work with hostnames
// with uppercase characters.
return service + "/" + fqdn.toLowerCase(Locale.US);
  }
  /* Return fqdn of the current host */
  static String getLocalHostName() throws UnknownHostException {
return InetAddress.getLocalHost().getCanonicalHostName();
  }
{code}
The following is the exception which cause the job fail:
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapr

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when RM HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS {{DelegationTokenIdentifier}}.
The reason why the local address is returned is: When HA is enabled, 
"yarn.resourcemanager.address" may not be set, so the default address 
"0.0.0.0:8032" will be used,  Based on the following code at SecurityUtil.java, 
the local address will be used to replace "0.0.0.0".
{code}
  private static String replacePattern(String[] components, String hostname)
  throws IOException {
String fqdn = hostname;
if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
  fqdn = getLocalHostName();
}
return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
components[2];
  }
  static String getLocalHostName() throws UnknownHostException {
return InetAddress.getLocalHost().getCanonicalHostName();
  }
  public static String getServerPrincipal(String principalConfig,
  InetAddress addr) throws IOException {
String[] components = getComponents(principalConfig);
if (components == null || components.length != 3
|| !components[1].equals(HOSTNAME_PATTERN)) {
  return principalConfig;
} else {
  if (addr == null) {
throw new IOException("Can't replace " + HOSTNAME_PATTERN
+ " pattern since client address is null");
  }
  return replacePattern(components, addr.getCanonicalHostName());
}
  }
{code}
The following is the exception which cause the job fail:
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
at 
org.apache

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when RM HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS {{DelegationTokenIdentifier}}.
The reason why the local address is returned is: When HA is enabled, 
"yarn.resourcemanager.address" may not be set and  
{{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so 
the default address "0.0.0.0:8032" will be used,  Based on the following code 
at SecurityUtil.java, the local address will be used to replace "0.0.0.0".
{code}
  private static String replacePattern(String[] components, String hostname)
  throws IOException {
String fqdn = hostname;
if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
  fqdn = getLocalHostName();
}
return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
components[2];
  }
  static String getLocalHostName() throws UnknownHostException {
return InetAddress.getLocalHost().getCanonicalHostName();
  }
  public static String getServerPrincipal(String principalConfig,
  InetAddress addr) throws IOException {
String[] components = getComponents(principalConfig);
if (components == null || components.length != 3
|| !components[1].equals(HOSTNAME_PATTERN)) {
  return principalConfig;
} else {
  if (addr == null) {
throw new IOException("Can't replace " + HOSTNAME_PATTERN
+ " pattern since client address is null");
  }
  return replacePattern(components, addr.getCanonicalHostName());
}
  }
{code}
The following is the exception which cause the job fail:
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at or

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead of RM address as token renewer in a 
secure cluster when RM HA is enabled. This will cause HDFS token renew failure 
for renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} 
exclude the client address in HDFS {{DelegationTokenIdentifier}}.
The reason why the local address is returned is: When HA is enabled, 
"yarn.resourcemanager.address" may not be set and  
{{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so 
the default address "0.0.0.0:8032" will be used,  Based on the following code 
at SecurityUtil.java, the local address will be used to replace "0.0.0.0".
{code}
  private static String replacePattern(String[] components, String hostname)
  throws IOException {
String fqdn = hostname;
if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
  fqdn = getLocalHostName();
}
return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
components[2];
  }
  static String getLocalHostName() throws UnknownHostException {
return InetAddress.getLocalHost().getCanonicalHostName();
  }
  public static String getServerPrincipal(String principalConfig,
  InetAddress addr) throws IOException {
String[] components = getComponents(principalConfig);
if (components == null || components.length != 3
|| !components[1].equals(HOSTNAME_PATTERN)) {
  return principalConfig;
} else {
  if (addr == null) {
throw new IOException("Can't replace " + HOSTNAME_PATTERN
+ " pattern since client address is null");
  }
  return replacePattern(components, addr.getCanonicalHostName());
}
  }
{code}
The following is the exception which cause the job fail:
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Summary: Yarn Client uses local address instead of RM address as token 
renewer in a secure cluster when RM HA is enabled.  (was: Yarn Client uses 
local address instead RM address as token renewer in a secure cluster when RM 
HA is enabled.)

> Yarn Client uses local address instead of RM address as token renewer in a 
> secure cluster when RM HA is enabled.
> 
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead RM address as token renewer in a 
> secure cluster when RM HA is enabled. This will cause HDFS token renew 
> failure for renewer "nobody"  if the rules from 
> {{hadoop.security.auth_to_local}} exclude the client address in HDFS 
> {{DelegationTokenIdentifier}}.
> The reason why the local address is returned is: When HA is enabled, 
> "yarn.resourcemanager.address" may not be set and  
> {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so 
> the default address "0.0.0.0:8032" will be used,  Based on the following code 
> at SecurityUtil.java, the local address will be used to replace "0.0.0.0".
> {code}
>   private static String replacePattern(String[] components, String hostname)
>   throws IOException {
> String fqdn = hostname;
> if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
>   fqdn = getLocalHostName();
> }
> return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
> components[2];
>   }
>   static String getLocalHostName() throws UnknownHostException {
> return InetAddress.getLocalHost().getCanonicalHostName();
>   }
>   public static String getServerPrincipal(String principalConfig,
>   InetAddress addr) throws IOException {
> String[] components = getComponents(principalConfig);
> if (components == null || components.length != 3
> || !components[1].equals(HOSTNAME_PATTERN)) {
>   return principalConfig;
> } else {
>   if (addr == null) {
> throw new IOException("Can't replace " + HOSTNAME_PATTERN
> + " pattern since client address is null");
>   }
>   return replacePattern(components, addr.getCanonicalHostName());
> }
>   }
> {code}
> The following is the exception which cause the job fail:
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientPro

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Summary: Yarn Client uses local address instead of RM address as token 
renewer in a secure cluster when RM HA is enabled.  (was: Yarn Client uses 
local client address instead of RM address as token renewer in a secure cluster 
when RM HA is enabled.)

> Yarn Client uses local address instead of RM address as token renewer in a 
> secure cluster when RM HA is enabled.
> 
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead of RM address as token renewer in a 
> secure cluster when RM HA is enabled. This will cause HDFS token renew 
> failure for renewer "nobody"  if the rules from 
> {{hadoop.security.auth_to_local}} exclude the client address in HDFS 
> {{DelegationTokenIdentifier}}.
> The reason why the local address is returned is: When HA is enabled, 
> "yarn.resourcemanager.address" may not be set and  
> {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so 
> the default address "0.0.0.0:8032" will be used,  Based on the following code 
> at SecurityUtil.java, the local address will be used to replace "0.0.0.0".
> {code}
>   private static String replacePattern(String[] components, String hostname)
>   throws IOException {
> String fqdn = hostname;
> if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
>   fqdn = getLocalHostName();
> }
> return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
> components[2];
>   }
>   static String getLocalHostName() throws UnknownHostException {
> return InetAddress.getLocalHost().getCanonicalHostName();
>   }
>   public static String getServerPrincipal(String principalConfig,
>   InetAddress addr) throws IOException {
> String[] components = getComponents(principalConfig);
> if (components == null || components.length != 3
> || !components[1].equals(HOSTNAME_PATTERN)) {
>   return principalConfig;
> } else {
>   if (addr == null) {
> throw new IOException("Can't replace " + HOSTNAME_PATTERN
> + " pattern since client address is null");
>   }
>   return replacePattern(components, addr.getCanonicalHostName());
> }
>   }
> {code}
> The following is the exception which cause the job fail:
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderP

[jira] [Updated] (YARN-4187) Yarn Client uses local client address instead of RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Summary: Yarn Client uses local client address instead of RM address as 
token renewer in a secure cluster when RM HA is enabled.  (was: Yarn Client 
uses local address instead of RM address as token renewer in a secure cluster 
when RM HA is enabled.)

> Yarn Client uses local client address instead of RM address as token renewer 
> in a secure cluster when RM HA is enabled.
> ---
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead of RM address as token renewer in a 
> secure cluster when RM HA is enabled. This will cause HDFS token renew 
> failure for renewer "nobody"  if the rules from 
> {{hadoop.security.auth_to_local}} exclude the client address in HDFS 
> {{DelegationTokenIdentifier}}.
> The reason why the local address is returned is: When HA is enabled, 
> "yarn.resourcemanager.address" may not be set and  
> {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal",so 
> the default address "0.0.0.0:8032" will be used,  Based on the following code 
> at SecurityUtil.java, the local address will be used to replace "0.0.0.0".
> {code}
>   private static String replacePattern(String[] components, String hostname)
>   throws IOException {
> String fqdn = hostname;
> if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
>   fqdn = getLocalHostName();
> }
> return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
> components[2];
>   }
>   static String getLocalHostName() throws UnknownHostException {
> return InetAddress.getLocalHost().getCanonicalHostName();
>   }
>   public static String getServerPrincipal(String principalConfig,
>   InetAddress addr) throws IOException {
> String[] components = getComponents(principalConfig);
> if (components == null || components.length != 3
> || !components[1].equals(HOSTNAME_PATTERN)) {
>   return principalConfig;
> } else {
>   if (addr == null) {
> throw new IOException("Can't replace " + HOSTNAME_PATTERN
> + " pattern since client address is null");
>   }
>   return replacePattern(components, addr.getCanonicalHostName());
> }
>   }
> {code}
> The following is the exception which cause the job fail:
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.Authoriz

[jira] [Updated] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4187:

Description: 
Yarn Client uses local address instead of RM address as token renewer in a 
secure cluster when RM HA is enabled. This will cause HDFS token renew failure 
for renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} 
exclude the client address in HDFS {{DelegationTokenIdentifier}}.
The reason why the local address is returned is: When HA is enabled, 
"yarn.resourcemanager.address" may not be set,  if 
{{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal", the 
default address "0.0.0.0:8032" will be used,  Based on the following code at 
SecurityUtil.java, the local address will be used to replace "0.0.0.0".
{code}
  private static String replacePattern(String[] components, String hostname)
  throws IOException {
String fqdn = hostname;
if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
  fqdn = getLocalHostName();
}
return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
components[2];
  }
  static String getLocalHostName() throws UnknownHostException {
return InetAddress.getLocalHost().getCanonicalHostName();
  }
  public static String getServerPrincipal(String principalConfig,
  InetAddress addr) throws IOException {
String[] components = getComponents(principalConfig);
if (components == null || components.length != 3
|| !components[1].equals(HOSTNAME_PATTERN)) {
  return principalConfig;
} else {
  if (addr == null) {
throw new IOException("Can't replace " + HOSTNAME_PATTERN
+ " pattern since client address is null");
  }
  return replacePattern(components, addr.getCanonicalHostName());
}
  }
{code}
The following is the exception which cause the job fail:
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at o

[jira] [Commented] (YARN-4187) Yarn Client uses local address instead of RM address as token renewer in a secure cluster when RM HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876724#comment-14876724
 ] 

zhihai xu commented on YARN-4187:
-

I attached a patch YARN-4187.000.patch which use YarnConfiguration instead of 
JobConf(Configuration) to call {{getSocketAddr}}, because 
Configuration.getSocketAddr won't support RM HA. Also we need check whether 
RM_HA_ID is configured, if RM_HA_ID is not configured, the first one in 
RM_HA_IDS will be used, otherwise it will fall back to use the client local 
address.

> Yarn Client uses local address instead of RM address as token renewer in a 
> secure cluster when RM HA is enabled.
> 
>
> Key: YARN-4187
> URL: https://issues.apache.org/jira/browse/YARN-4187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4187.000.patch
>
>
> Yarn Client uses local address instead of RM address as token renewer in a 
> secure cluster when RM HA is enabled. This will cause HDFS token renew 
> failure for renewer "nobody"  if the rules from 
> {{hadoop.security.auth_to_local}} exclude the client address in HDFS 
> {{DelegationTokenIdentifier}}.
> The reason why the local address is returned is: When HA is enabled, 
> "yarn.resourcemanager.address" may not be set,  if 
> {{HOSTNAME_PATTERN}}("_HOST") is used in "yarn.resourcemanager.principal", 
> the default address "0.0.0.0:8032" will be used,  Based on the following code 
> at SecurityUtil.java, the local address will be used to replace "0.0.0.0".
> {code}
>   private static String replacePattern(String[] components, String hostname)
>   throws IOException {
> String fqdn = hostname;
> if (fqdn == null || fqdn.isEmpty() || fqdn.equals("0.0.0.0")) {
>   fqdn = getLocalHostName();
> }
> return components[0] + "/" + fqdn.toLowerCase(Locale.US) + "@" + 
> components[2];
>   }
>   static String getLocalHostName() throws UnknownHostException {
> return InetAddress.getLocalHost().getCanonicalHostName();
>   }
>   public static String getServerPrincipal(String principalConfig,
>   InetAddress addr) throws IOException {
> String[] components = getComponents(principalConfig);
> if (components == null || components.length != 3
> || !components[1].equals(HOSTNAME_PATTERN)) {
>   return principalConfig;
> } else {
>   if (addr == null) {
> throw new IOException("Can't replace " + HOSTNAME_PATTERN
> + " pattern since client address is null");
>   }
>   return replacePattern(components, addr.getCanonicalHostName());
> }
>   }
> {code}
> The following is the exception which cause the job fail:
> {code}
> 15/09/12 16:27:24 WARN security.UserGroupInformation: 
> PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: Failed to run job : yarn tries to renew a token 
> with renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> java.io.IOException: Failed to run job : yarn tries to renew a token with 
> renewer nobody
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
> at 
> or

[jira] [Created] (YARN-4190) Add container information in FairScheduler preemption log to help debug.

2015-09-18 Thread zhihai xu (JIRA)
zhihai xu created YARN-4190:
---

 Summary: Add container information in FairScheduler preemption log 
to help debug.
 Key: YARN-4190
 URL: https://issues.apache.org/jira/browse/YARN-4190
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.1
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial


Add container information in FairScheduler preemption log to help debug. 
Currently the following log doesn't have container information
{code}
LOG.info("Preempting container (prio=" + container.getContainer().getPriority() 
+
"res=" + container.getContainer().getResource() +
") from queue " + queue.getName());
{code}
So it will be very difficult to debug preemption related issue for 
FairScheduler.
Even the container information is printed in the following code
{code}
LOG.info("Killing container" + container +
" (after waiting for premption for " +
(getClock().getTime() - time) + "ms)");
{code}
But we can't match these two logs based on the container ID.
It will be very useful to add container information in the first log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4190) missing container information in FairScheduler preemption log.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4190:

Summary: missing container information in FairScheduler preemption log.  
(was: Add container information in FairScheduler preemption log to help debug.)

> missing container information in FairScheduler preemption log.
> --
>
> Key: YARN-4190
> URL: https://issues.apache.org/jira/browse/YARN-4190
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
>
> Add container information in FairScheduler preemption log to help debug. 
> Currently the following log doesn't have container information
> {code}
> LOG.info("Preempting container (prio=" + 
> container.getContainer().getPriority() +
> "res=" + container.getContainer().getResource() +
> ") from queue " + queue.getName());
> {code}
> So it will be very difficult to debug preemption related issue for 
> FairScheduler.
> Even the container information is printed in the following code
> {code}
> LOG.info("Killing container" + container +
> " (after waiting for premption for " +
> (getClock().getTime() - time) + "ms)");
> {code}
> But we can't match these two logs based on the container ID.
> It will be very useful to add container information in the first log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4190) missing container information in FairScheduler preemption log.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-4190.
-
Resolution: Later

> missing container information in FairScheduler preemption log.
> --
>
> Key: YARN-4190
> URL: https://issues.apache.org/jira/browse/YARN-4190
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
>
> Add container information in FairScheduler preemption log to help debug. 
> Currently the following log doesn't have container information
> {code}
> LOG.info("Preempting container (prio=" + 
> container.getContainer().getPriority() +
> "res=" + container.getContainer().getResource() +
> ") from queue " + queue.getName());
> {code}
> So it will be very difficult to debug preemption related issue for 
> FairScheduler.
> Even the container information is printed in the following code
> {code}
> LOG.info("Killing container" + container +
> " (after waiting for premption for " +
> (getClock().getTime() - time) + "ms)");
> {code}
> But we can't match these two logs based on the container ID.
> It will be very useful to add container information in the first log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4190) missing container information in FairScheduler preemption log.

2015-09-18 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876772#comment-14876772
 ] 

zhihai xu commented on YARN-4190:
-

thanks [~xinxianyin] for the information! Yes, good suggestion. resolve it as 
fix later, make it depend on YARN-4134.

> missing container information in FairScheduler preemption log.
> --
>
> Key: YARN-4190
> URL: https://issues.apache.org/jira/browse/YARN-4190
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
>
> Add container information in FairScheduler preemption log to help debug. 
> Currently the following log doesn't have container information
> {code}
> LOG.info("Preempting container (prio=" + 
> container.getContainer().getPriority() +
> "res=" + container.getContainer().getResource() +
> ") from queue " + queue.getName());
> {code}
> So it will be very difficult to debug preemption related issue for 
> FairScheduler.
> Even the container information is printed in the following code
> {code}
> LOG.info("Killing container" + container +
> " (after waiting for premption for " +
> (getClock().getTime() - time) + "ms)");
> {code}
> But we can't match these two logs based on the container ID.
> It will be very useful to add container information in the first log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.003.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900067#comment-14900067
 ] 

zhihai xu commented on YARN-3446:
-

Hi [~kasha], Thanks for the review! I attached a new patch YARN-3446.003.patch, 
which addressed your first comment. I also added more test cases to verify 
{{getHeadroom}} with blacklisted nodes remove and addition.
About your second comment: IMHO, if we didn't do the optimization, that will be 
a very big overhead for a large cluster. For example, we have 2000 AM running 
on 5000 nodes cluster, For each AM, we need go through 5000 nodes list to find 
the blacklisted {{SchedulerNode}} in the heartbeat. With 2000 AM, it will loop 
10,000,000 times. Normally number of blacklisted nodes should be very small for 
each application. So iterating on the blacklisted nodes may not be a 
performance issue. Also AM won't change blacklisted nodes frequently.
About your third comment, it is because currently {{SchedulerNode}} are stored 
in {{AbstractYarnScheduler#nodes}} with key {{NodeId}}. But 
{{AppSchedulingInfo}} stores the blacklisted nodes using {{String}} Node Name 
or Rack Name. I can't find an easy way to translate Node Name and Rack Name to 
{{NodeId}}. So it looks like we need iterate through 
{{AbstractYarnScheduler#nodes}} to find the blacklisted {{SchedulerNode}} if we 
use {{AppSchedulingInfo#getBlacklist}}. That means for a 5000 nodes cluster, we 
need loop 5000 times, a big overhead. {{AbstractYarnScheduler#nodes}} are 
defined at the following code:
{code}
  protected Map nodes = new ConcurrentHashMap();
{code}

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: (was: YARN-3446.003.patch)

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3446:

Attachment: YARN-3446.003.patch

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch, 
> YARN-3446.002.patch, YARN-3446.003.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-20 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4095:

Attachment: YARN-4095.001.patch

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900285#comment-14900285
 ] 

zhihai xu commented on YARN-4095:
-

Hi [~Jason Lowe], Could you help review the patch? thanks

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900286#comment-14900286
 ] 

zhihai xu commented on YARN-4095:
-

Hi [~jlowe], Could you help review the patch? thanks

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901099#comment-14901099
 ] 

zhihai xu commented on YARN-4095:
-

The first patch put {{NM_GOOD_LOCAL_DIRS}} and {{NM_GOOD_LOG_DIRS}} in 
YarnConfiguration.java, the second patch moved them to 
LocalDirsHandlerService.java, since they are only used inside 
{{LocalDirsHandlerService}}.

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-09-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904722#comment-14904722
 ] 

zhihai xu commented on YARN-4095:
-

Thanks [~jlowe] for reviewing and committing the patch!

> Avoid sharing AllocatorPerContext object in LocalDirAllocator between 
> ShuffleHandler and LocalDirsHandlerService.
> -
>
> Key: YARN-4095
> URL: https://issues.apache.org/jira/browse/YARN-4095
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
> Attachments: YARN-4095.000.patch, YARN-4095.001.patch
>
>
> Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
> {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
> {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static 
> TreeMap with configuration name as key
> {code}
>   private static Map  contexts = 
>  new TreeMap();
> {code}
> {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
> {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
> {{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
> object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value 
> in its {{Configuration}} object to exclude full and bad local dirs, 
> {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
> {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} 
> is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
> {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
> is changed. This will cause some overhead.
> {code}
>   String newLocalDirs = conf.get(contextCfgItemName);
>   if (!newLocalDirs.equals(savedLocalDirs)) {
> {code}
> So it will be a good improvement to not share the same 
> {{AllocatorPerContext}} instance between {{ShuffleHandler}} and 
> {{LocalDirsHandlerService}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-09-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: YARN-3943.001.patch

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-09-25 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909043#comment-14909043
 ] 

zhihai xu commented on YARN-3943:
-

I attached a new patch YARN-3943.001.patch for review. The new patch will keep 
backward compatibility by using the old configuration 
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" 
as high water mark threshold for disk-full detection and creating a new 
configuration 
"yarn.nodemanager.disk-health-checker.disk-utilization-watermark-low-per-disk-percentage"
 as low water mark threshold for disk-not-full detection. It also makes both 
configurations use same default value and if low water mark threshold is more 
than high water mark threshold, it will be set to the same value as high water 
mark threshold.

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-09-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: (was: YARN-3943.001.patch)

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-09-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: YARN-3943.001.patch

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-09-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: (was: YARN-3943.001.patch)

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-09-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3943:

Attachment: YARN-3943.001.patch

> Use separate threshold configurations for disk-full detection and 
> disk-not-full detection.
> --
>
> Key: YARN-3943
> URL: https://issues.apache.org/jira/browse/YARN-3943
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3943.000.patch, YARN-3943.001.patch
>
>
> Use separate threshold configurations to check when disks become full and 
> when disks become good. Currently the configuration 
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"
>  and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are 
> used to check both when disks become full and when disks become good. It will 
> be better to use two configurations: one is used when disks become full from 
> not-full and the other one is used when disks become not-full from full. So 
> we can avoid oscillating frequently.
> For example: we can set the one for disk-full detection higher than the one 
> for disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4209) RMStateStore FENCED state doesn’t work

2015-09-28 Thread zhihai xu (JIRA)
zhihai xu created YARN-4209:
---

 Summary: RMStateStore FENCED state doesn’t work
 Key: YARN-4209
 URL: https://issues.apache.org/jira/browse/YARN-4209
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


RMStateStore FENCED state doesn’t work. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
notifyStoreOperationFailed is called. The only working case for FENCED state is 
{{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Attachment: YARN-4209.000.patch

> RMStateStore FENCED state doesn’t work
> --
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even notifyStoreOperationFailed is called. The only working case for 
> FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Affects Version/s: (was: 2.7.1)

> RMStateStore FENCED state doesn’t work
> --
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even notifyStoreOperationFailed is called. The only working case for 
> FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Affects Version/s: 2.7.2

> RMStateStore FENCED state doesn’t work
> --
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even notifyStoreOperationFailed is called. The only working case for 
> FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work

2015-09-28 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910128#comment-14910128
 ] 

zhihai xu commented on YARN-4209:
-

I attached a patch YARN-4209.000.patch which move {{updateFencedState}} from 
{{notifyStoreOperationFailed}} to {{StandByTransitionThread}}. So 
{{updateFencedState}} won't be called by {{stateMachine.doTransition}}.

> RMStateStore FENCED state doesn’t work
> --
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even notifyStoreOperationFailed is called. The only working case for 
> FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Summary: RMStateStore FENCED state doesn’t work due to updateFencedState 
called by stateMachine.doTransition  (was: RMStateStore FENCED state doesn’t 
work)

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even notifyStoreOperationFailed is called. The only working case for 
> FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Description: 
RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
{{stateMachine.doTransition}}. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
notifyStoreOperationFailed is called. The only working case for FENCED state is 
{{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.


  was:
RMStateStore FENCED state doesn’t work. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
notifyStoreOperationFailed is called. The only working case for FENCED state is 
{{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.



> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even notifyStoreOperationFailed is called. The only working case for 
> FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Description: 
RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
{{stateMachine.doTransition}}. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
{{notifyStoreOperationFailed}} is called. The only working case for FENCED 
state is {{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.


  was:
RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
{{stateMachine.doTransition}}. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
notifyStoreOperationFailed is called. The only working case for FENCED state is 
{{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.



> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even {{notifyStoreOperationFailed}} is called. The only working case 
> for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910141#comment-14910141
 ] 

zhihai xu commented on YARN-4209:
-

Also the test case in the patch can verify this issue. Without the change, the 
RMStateStore is still in ACTIVE state even after {{updateFencedState}} is 
called.

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even {{notifyStoreOperationFailed}} is called. The only working case 
> for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Description: 
RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
{{stateMachine.doTransition}}. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
after {{notifyStoreOperationFailed}} is called. The only working case for 
FENCED state is {{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.


  was:
RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
{{stateMachine.doTransition}}. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
{{notifyStoreOperationFailed}} is called. The only working case for FENCED 
state is {{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.



> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Attachment: (was: YARN-4209.000.patch)

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-4209:

Attachment: YARN-4209.000.patch

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934261#comment-14934261
 ] 

zhihai xu commented on YARN-4209:
-

Hi [~jianhe], Could you help review the patch? I add lock and check 
{{isFencedState}} in {{StandByTransitionThread}} to make sure 
{{handleTransitionToStandBy}} and {{updateFencedState}} are only called once to 
avoid any potential race condition.

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition

2015-09-28 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934695#comment-14934695
 ] 

zhihai xu commented on YARN-4209:
-

Thanks for the review [~rohithsharma]! Yes, that is a good point! Using 
MultipleArcTransition will be a better solution. I will implement a new patch 
using MultipleArcTransition.

> RMStateStore FENCED state doesn’t work due to updateFencedState called by 
> stateMachine.doTransition
> ---
>
> Key: YARN-4209
> URL: https://issues.apache.org/jira/browse/YARN-4209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-4209.000.patch
>
>
> RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by 
> {{stateMachine.doTransition}}. The reason is
> {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded 
> in {{stateMachine.doTransition}} called from public 
> API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So 
> right after the internal state transition from {{updateFencedState}} changes 
> the state to FENCED state, the external state transition changes the state 
> back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE 
> state even after {{notifyStoreOperationFailed}} is called. The only working 
> case for FENCED state is {{notifyStoreOperationFailed}} called from 
> {{ZKRMStateStore#VerifyActiveStatusThread}}.
> For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
> external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
> {{notifyStoreOperationFailed}} 
> =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal 
> {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} 
> change state to FENCED => exit external {{stateMachine.doTransition}} change 
> state to ACTIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.

2015-09-29 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3727:

Attachment: YARN-3727.001.patch

> For better error recovery, check if the directory exists before using it for 
> localization.
> --
>
> Key: YARN-3727
> URL: https://issues.apache.org/jira/browse/YARN-3727
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3727.000.patch, YARN-3727.001.patch
>
>
> For better error recovery, check if the directory exists before using it for 
> localization.
> We saw the following localization failure happened due to existing cache 
> directories.
> {code}
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, 
> null }, Rename cannot overwrite non empty destination directory 
> //8/yarn/nm/usercache//filecache/21637
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar)
>  transitioned from DOWNLOADING to FAILED
> {code}
> The real cause for this failure may be disk failure, LevelDB operation 
> failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or 
> others.
> I wonder whether we can add error recovery code to avoid the localization 
> failure by not using the existing cache directories for localization.
> The exception happened at {{files.rename(dst_work, destDirPath, 
> Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after 
> the exception, the existing cache directory used by {{LocalizedResource}} 
> will be deleted.
> {code}
> try {
>  .
>   files.rename(dst_work, destDirPath, Rename.OVERWRITE);
> } catch (Exception e) {
>   try {
> files.delete(destDirPath, true);
>   } catch (IOException ignore) {
>   }
>   throw e;
> } finally {
> {code}
> Since the conflicting local directory will be deleted after localization 
> failure,
> I think it will be better to check if the directory exists before using it 
> for localization to avoid the localization failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.

2015-09-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935634#comment-14935634
 ] 

zhihai xu commented on YARN-3727:
-

[~lichangleo], [~jlowe], thanks for the review! Yes, I uploaded a new patch 
YARN-3727.001.patch based on the latest code at trunk.

> For better error recovery, check if the directory exists before using it for 
> localization.
> --
>
> Key: YARN-3727
> URL: https://issues.apache.org/jira/browse/YARN-3727
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3727.000.patch, YARN-3727.001.patch
>
>
> For better error recovery, check if the directory exists before using it for 
> localization.
> We saw the following localization failure happened due to existing cache 
> directories.
> {code}
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, 
> null }, Rename cannot overwrite non empty destination directory 
> //8/yarn/nm/usercache//filecache/21637
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar)
>  transitioned from DOWNLOADING to FAILED
> {code}
> The real cause for this failure may be disk failure, LevelDB operation 
> failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or 
> others.
> I wonder whether we can add error recovery code to avoid the localization 
> failure by not using the existing cache directories for localization.
> The exception happened at {{files.rename(dst_work, destDirPath, 
> Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after 
> the exception, the existing cache directory used by {{LocalizedResource}} 
> will be deleted.
> {code}
> try {
>  .
>   files.rename(dst_work, destDirPath, Rename.OVERWRITE);
> } catch (Exception e) {
>   try {
> files.delete(destDirPath, true);
>   } catch (IOException ignore) {
>   }
>   throw e;
> } finally {
> {code}
> Since the conflicting local directory will be deleted after localization 
> failure,
> I think it will be better to check if the directory exists before using it 
> for localization to avoid the localization failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.

2015-09-29 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3727:

Attachment: YARN-3727.001.patch

> For better error recovery, check if the directory exists before using it for 
> localization.
> --
>
> Key: YARN-3727
> URL: https://issues.apache.org/jira/browse/YARN-3727
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3727.000.patch, YARN-3727.001.patch
>
>
> For better error recovery, check if the directory exists before using it for 
> localization.
> We saw the following localization failure happened due to existing cache 
> directories.
> {code}
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, 
> null }, Rename cannot overwrite non empty destination directory 
> //8/yarn/nm/usercache//filecache/21637
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar)
>  transitioned from DOWNLOADING to FAILED
> {code}
> The real cause for this failure may be disk failure, LevelDB operation 
> failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or 
> others.
> I wonder whether we can add error recovery code to avoid the localization 
> failure by not using the existing cache directories for localization.
> The exception happened at {{files.rename(dst_work, destDirPath, 
> Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after 
> the exception, the existing cache directory used by {{LocalizedResource}} 
> will be deleted.
> {code}
> try {
>  .
>   files.rename(dst_work, destDirPath, Rename.OVERWRITE);
> } catch (Exception e) {
>   try {
> files.delete(destDirPath, true);
>   } catch (IOException ignore) {
>   }
>   throw e;
> } finally {
> {code}
> Since the conflicting local directory will be deleted after localization 
> failure,
> I think it will be better to check if the directory exists before using it 
> for localization to avoid the localization failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.

2015-09-29 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3727:

Attachment: (was: YARN-3727.001.patch)

> For better error recovery, check if the directory exists before using it for 
> localization.
> --
>
> Key: YARN-3727
> URL: https://issues.apache.org/jira/browse/YARN-3727
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3727.000.patch, YARN-3727.001.patch
>
>
> For better error recovery, check if the directory exists before using it for 
> localization.
> We saw the following localization failure happened due to existing cache 
> directories.
> {code}
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, 
> null }, Rename cannot overwrite non empty destination directory 
> //8/yarn/nm/usercache//filecache/21637
> 2015-05-11 18:59:59,756 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar)
>  transitioned from DOWNLOADING to FAILED
> {code}
> The real cause for this failure may be disk failure, LevelDB operation 
> failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or 
> others.
> I wonder whether we can add error recovery code to avoid the localization 
> failure by not using the existing cache directories for localization.
> The exception happened at {{files.rename(dst_work, destDirPath, 
> Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after 
> the exception, the existing cache directory used by {{LocalizedResource}} 
> will be deleted.
> {code}
> try {
>  .
>   files.rename(dst_work, destDirPath, Rename.OVERWRITE);
> } catch (Exception e) {
>   try {
> files.delete(destDirPath, true);
>   } catch (IOException ignore) {
>   }
>   throw e;
> } finally {
> {code}
> Since the conflicting local directory will be deleted after localization 
> failure,
> I think it will be better to check if the directory exists before using it 
> for localization to avoid the localization failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >