[jira] [Updated] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-23 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2328:
---

Attachment: yarn-2328-2.patch

Thanks Sandy. Removed the unrelated change. 

Will commit this if Jenkins is fine. 

> FairScheduler: Verify update and continuous scheduling threads are stopped 
> when the scheduler is stopped
> 
>
> Key: YARN-2328
> URL: https://issues.apache.org/jira/browse/YARN-2328
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Minor
> Attachments: yarn-2328-1.patch, yarn-2328-2.patch
>
>
> FairScheduler threads can use a little cleanup and tests. To begin with, the 
> update and continuous-scheduling threads should extend Thread and handle 
> being interrupted. We should have tests for starting and stopping them as 
> well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread zhihai xu (JIRA)
zhihai xu created YARN-2337:
---

 Summary: remove duplication function call (setClientRMService) in 
resource manage class
 Key: YARN-2337
 URL: https://issues.apache.org/jira/browse/YARN-2337
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Priority: Minor


remove duplication function call (setClientRMService) in resource manage class.
rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reassigned YARN-2337:
---

Assignee: zhihai xu

> remove duplication function call (setClientRMService) in resource manage class
> --
>
> Key: YARN-2337
> URL: https://issues.apache.org/jira/browse/YARN-2337
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
>
> remove duplication function call (setClientRMService) in resource manage 
> class.
> rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
> ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2337:


Attachment: YARN-2337.000.patch

> remove duplication function call (setClientRMService) in resource manage class
> --
>
> Key: YARN-2337
> URL: https://issues.apache.org/jira/browse/YARN-2337
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2337.000.patch
>
>
> remove duplication function call (setClientRMService) in resource manage 
> class.
> rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
> ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071450#comment-14071450
 ] 

zhihai xu commented on YARN-2337:
-

It is not necessary to call rmContext.setClientRMService(clientRM); twice in 
the following code.
  rmContext.setClientRMService(clientRM);
   addService(clientRM);
   rmContext.setClientRMService(clientRM);
 the first one is removed in the patch.

> remove duplication function call (setClientRMService) in resource manage class
> --
>
> Key: YARN-2337
> URL: https://issues.apache.org/jira/browse/YARN-2337
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2337.000.patch
>
>
> remove duplication function call (setClientRMService) in resource manage 
> class.
> rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
> ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071470#comment-14071470
 ] 

Hadoop QA commented on YARN-2284:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657289/YARN2284-03.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  org.apache.hadoop.ipc.TestIPC

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4399//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4399//console

This message is automatically generated.

> Find missing config options in YarnConfiguration and yarn-default.xml
> -
>
> Key: YARN-2284
> URL: https://issues.apache.org/jira/browse/YARN-2284
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>Priority: Minor
>  Labels: supportability
> Attachments: YARN2284-01.patch, YARN2284-02.patch, YARN2284-03.patch
>
>
> YarnConfiguration has one set of properties.  yarn-default.xml has another 
> set of properties.  Ideally, there should be an automatic way to find missing 
> properties in either location.
> This is analogous to MAPREDUCE-5130, but for yarn-default.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2338) service assemble so complex

2014-07-23 Thread tangjunjie (JIRA)
tangjunjie created YARN-2338:


 Summary: service assemble so complex
 Key: YARN-2338
 URL: https://issues.apache.org/jira/browse/YARN-2338
 Project: Hadoop YARN
  Issue Type: Wish
Reporter: tangjunjie


  See ResourceManager
protected void serviceInit(Configuration configuration) throws Exception 

So many service will assembe into resourcemanager.

Use guice or other service assemble framework to refactor this complex code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2339) service assemble so complex

2014-07-23 Thread tangjunjie (JIRA)
tangjunjie created YARN-2339:


 Summary: service assemble so complex
 Key: YARN-2339
 URL: https://issues.apache.org/jira/browse/YARN-2339
 Project: Hadoop YARN
  Issue Type: Wish
Reporter: tangjunjie


  See ResourceManager
protected void serviceInit(Configuration configuration) throws Exception 

So many service will assembe into resourcemanager.

Use guice or other service assemble framework to refactor this complex code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2339) service assemble so complex

2014-07-23 Thread tangjunjie (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangjunjie resolved YARN-2339.
--

Resolution: Duplicate

> service assemble so complex
> ---
>
> Key: YARN-2339
> URL: https://issues.apache.org/jira/browse/YARN-2339
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: tangjunjie
>
>   See ResourceManager
> protected void serviceInit(Configuration configuration) throws Exception 
> So many service will assembe into resourcemanager.
> Use guice or other service assemble framework to refactor this complex code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2338) service assemble so complex

2014-07-23 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071495#comment-14071495
 ] 

Tsuyoshi OZAWA commented on YARN-2338:
--

Hi, do you mean that we should use DI framework? What kind of refatoring are 
you planning to do?

> service assemble so complex
> ---
>
> Key: YARN-2338
> URL: https://issues.apache.org/jira/browse/YARN-2338
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: tangjunjie
>
>   See ResourceManager
> protected void serviceInit(Configuration configuration) throws Exception 
> So many service will assembe into resourcemanager.
> Use guice or other service assemble framework to refactor this complex code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071496#comment-14071496
 ] 

Tsuyoshi OZAWA commented on YARN-2337:
--

+1 (non-binding), let's waiting for the result of Jenkins CI.

> remove duplication function call (setClientRMService) in resource manage class
> --
>
> Key: YARN-2337
> URL: https://issues.apache.org/jira/browse/YARN-2337
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2337.000.patch
>
>
> remove duplication function call (setClientRMService) in resource manage 
> class.
> rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
> ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071503#comment-14071503
 ] 

Hadoop QA commented on YARN-2328:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657303/yarn-2328-2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4400//console

This message is automatically generated.

> FairScheduler: Verify update and continuous scheduling threads are stopped 
> when the scheduler is stopped
> 
>
> Key: YARN-2328
> URL: https://issues.apache.org/jira/browse/YARN-2328
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Minor
> Attachments: yarn-2328-1.patch, yarn-2328-2.patch
>
>
> FairScheduler threads can use a little cleanup and tests. To begin with, the 
> update and continuous-scheduling threads should extend Thread and handle 
> being interrupted. We should have tests for starting and stopping them as 
> well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071547#comment-14071547
 ] 

Karthik Kambatla commented on YARN-2313:


Sorry for coming in late here. Didn't see this before.

I think we need a better solution here. Otherwise, clusters will continue to 
run into this. 

One simple way to address this could be to wait {{updateInterval}} ms after 
finishing an iteration of update-thread before starting the next iteration. We 
should do something similar for the continuous thread as well. 

> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-23 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2328:
---

Attachment: yarn-2328-2.patch

Updated patch on latest trunk. 

> FairScheduler: Verify update and continuous scheduling threads are stopped 
> when the scheduler is stopped
> 
>
> Key: YARN-2328
> URL: https://issues.apache.org/jira/browse/YARN-2328
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Minor
> Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch
>
>
> FairScheduler threads can use a little cleanup and tests. To begin with, the 
> update and continuous-scheduling threads should extend Thread and handle 
> being interrupted. We should have tests for starting and stopping them as 
> well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071558#comment-14071558
 ] 

Karthik Kambatla commented on YARN-2313:


Actually, thinking more about it, I don't quite understand how the 
update-thread can go into a busy loop. Thread.sleep() and update are called 
serially. So, irrespective of how long update() takes the next Thread.sleep is 
called for 500 ms, no? 

It is possible that these 500 ms are not enough for other work and the 
scheduler lags, but should still make progress. 

> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071566#comment-14071566
 ] 

Hadoop QA commented on YARN-2337:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657305/YARN-2337.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4401//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4401//console

This message is automatically generated.

> remove duplication function call (setClientRMService) in resource manage class
> --
>
> Key: YARN-2337
> URL: https://issues.apache.org/jira/browse/YARN-2337
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2337.000.patch
>
>
> remove duplication function call (setClientRMService) in resource manage 
> class.
> rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
> ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071599#comment-14071599
 ] 

Hudson commented on YARN-2295:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/621/])
YARN-2295. Refactored DistributedShell to use public APIs of protocol records. 
Contributed by Li Lu (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612626)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java


> Refactor YARN distributed shell with existing public stable API
> ---
>
> Key: YARN-2295
> URL: https://issues.apache.org/jira/browse/YARN-2295
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, 
> YARN-2295-071514.patch, YARN-2295-072114.patch
>
>
> Some API calls in YARN distributed shell have been marked as unstable and 
> private. Use existing public stable API to replace them, if possible. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071597#comment-14071597
 ] 

Hudson commented on YARN-2313:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/621/])
YARN-2313. Livelock can occur in FairScheduler when there are lots of running 
apps (Tsuyoshi Ozawa via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612769)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071611#comment-14071611
 ] 

Hudson commented on YARN-2242:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/621/])
YARN-2242. Addendum patch. Improve exception information on AM launch crashes. 
(Contributed by Li Lu) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612565)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, 
> YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, 
> YARN-2242-071414.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071596#comment-14071596
 ] 

Hudson commented on YARN-2273:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/621/])
YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612720)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> NPE in ContinuousScheduling thread when we lose a node
> --
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
>Assignee: Wei Yan
> Fix For: 2.6.0
>
> Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071610#comment-14071610
 ] 

Hudson commented on YARN-2319:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/621/])
YARN-2319. Made the MiniKdc instance start/close before/after the class of 
TestRMWebServicesDelegationTokens. Contributed by Wenwu Peng. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612588)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java


> Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
> ---
>
> Key: YARN-2319
> URL: https://issues.apache.org/jira/browse/YARN-2319
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: resourcemanager
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Wenwu Peng
>Assignee: Wenwu Peng
> Fix For: 2.5.0
>
> Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, 
> YARN-2319.2.patch
>
>
> MiniKdc only invoke start method not stop in 
> TestRMWebServicesDelegationTokens.java
> {code}
> testMiniKDC.start();
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071607#comment-14071607
 ] 

Hudson commented on YARN-2131:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/621/])
YARN-2131. Addendum2: Document -format-state-store. Add a way to format the 
RMStateStore. (Robert Kanter via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612634)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm


> Add a way to format the RMStateStore
> 
>
> Key: YARN-2131
> URL: https://issues.apache.org/jira/browse/YARN-2131
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
> Fix For: 2.6.0
>
> Attachments: YARN-2131.patch, YARN-2131.patch, 
> YARN-2131_addendum.patch, YARN-2131_addendum2.patch
>
>
> There are cases when we don't want to recover past applications, but recover 
> applications going forward. To do this, one has to clear the store. Today, 
> there is no easy way to do this and users should understand how each store 
> works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071620#comment-14071620
 ] 

Tsuyoshi OZAWA commented on YARN-2313:
--

Hi Karthik, thank you for pointing it out.

{quote}
 So, irrespective of how long update() takes the next Thread.sleep is called 
for 500 ms, no?
{quote}

You're correct. The description "go busy loop" is wrong. But there still 
remains starvation problem:

1. {{FairScheduler#update()}} can take more than 10 sec, default value of 
reloadIntervalMs, with lock.
2. {{AllocationFileLoaderThread#onReload}} can take more than 500 ms, default 
value of updateInterval, with lock.
3. As a result, {{FairScheduler#update()}} and {{FairScheduler#onReload}} can 
always wins lock of the instance of {{FairScheduler}}.
4. {{ResourceManager$SchedulerEventDispatcher}} can wait forever.

The problem we faced was that cluster(note that it's very busy cluster!) hung 
up even after killing exist apps. I got the stack trace when we faced the 
problem. In our case, we can avoid the problem by setting the configuration 
value(updateInterval) larger. IIUC, it's because we can have the margin that 
ResourceManager$SchedulerEventDispatcher acquire lock. 

As you mentioned, this fix is just a workaround. However, it's effective. More 
essential way is making updateInterval and reloadIntervalMs dynamic. Please 
correct me if I'm wrong. 

> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree

2014-07-23 Thread Kenji Kikushima (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenji Kikushima updated YARN-2336:
--

Attachment: YARN-2336-2.patch

Fixed test failure.

> Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
> --
>
> Key: YARN-2336
> URL: https://issues.apache.org/jira/browse/YARN-2336
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Kenji Kikushima
>Assignee: Kenji Kikushima
> Attachments: YARN-2336-2.patch, YARN-2336.patch
>
>
> When we have sub queues in Fair Scheduler, REST api returns a missing '[' 
> blacket JSON for childQueues.
> This issue found by [~ajisakaa] at YARN-1050.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071636#comment-14071636
 ] 

Hadoop QA commented on YARN-2328:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657329/yarn-2328-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4402//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4402//console

This message is automatically generated.

> FairScheduler: Verify update and continuous scheduling threads are stopped 
> when the scheduler is stopped
> 
>
> Key: YARN-2328
> URL: https://issues.apache.org/jira/browse/YARN-2328
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Minor
> Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch
>
>
> FairScheduler threads can use a little cleanup and tests. To begin with, the 
> update and continuous-scheduling threads should extend Thread and handle 
> being interrupted. We should have tests for starting and stopping them as 
> well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071665#comment-14071665
 ] 

Hadoop QA commented on YARN-2336:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657339/YARN-2336-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4403//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4403//console

This message is automatically generated.

> Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
> --
>
> Key: YARN-2336
> URL: https://issues.apache.org/jira/browse/YARN-2336
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Kenji Kikushima
>Assignee: Kenji Kikushima
> Attachments: YARN-2336-2.patch, YARN-2336.patch
>
>
> When we have sub queues in Fair Scheduler, REST api returns a missing '[' 
> blacket JSON for childQueues.
> This issue found by [~ajisakaa] at YARN-1050.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-07-23 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071705#comment-14071705
 ] 

Sunil G commented on YARN-2301:
---

This will be really useful enhancement.

I have a concern here. 
bq. yarn container -list  

* *list* with ** comes after the variable input from user 
(appId|appAttemptId).
And  is only for one of the type named *appId*. May be it may 
confuse user also, like which sub option needs the . 
I feel may be we can have a new command itself for listing application 
container.
A suggestion is:
{noformat}
yarn container -list-appid   
yarn container -list-appattemptid  
{noformat}
OR
{noformat}
yarn application -list-containers 
{noformat}

* I feel sequential checks with ConverterUtils.toApplicationID and 
ConverterUtils.toApplicationAttemptId has to be done to know whether input is 
appId|appAttemptId.
So rediercting to my point 1, if seperate command is there, may be it can be 
handled in a better way from applicationCLI (rather than handling specific 
types of exceptions).
Please share your thoughts

> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071728#comment-14071728
 ] 

Hudson commented on YARN-2313:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/])
YARN-2313. Livelock can occur in FairScheduler when there are lots of running 
apps (Tsuyoshi Ozawa via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612769)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071730#comment-14071730
 ] 

Hudson commented on YARN-2295:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/])
YARN-2295. Refactored DistributedShell to use public APIs of protocol records. 
Contributed by Li Lu (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612626)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java


> Refactor YARN distributed shell with existing public stable API
> ---
>
> Key: YARN-2295
> URL: https://issues.apache.org/jira/browse/YARN-2295
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, 
> YARN-2295-071514.patch, YARN-2295-072114.patch
>
>
> Some API calls in YARN distributed shell have been marked as unstable and 
> private. Use existing public stable API to replace them, if possible. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071742#comment-14071742
 ] 

Hudson commented on YARN-2242:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/])
YARN-2242. Addendum patch. Improve exception information on AM launch crashes. 
(Contributed by Li Lu) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612565)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, 
> YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, 
> YARN-2242-071414.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-07-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071726#comment-14071726
 ] 

Jason Lowe commented on YARN-2314:
--

I suppose we could use a wait timeout.  I was just matching the behavior when 
it tries to refresh the NM token on an in-use proxy which also waits 
indefinitely.  What's the proposed behavior when the timeout expires?  Log a 
message and then...?  Arguably the timeouts should be on the RPC calls rather 
than the proxy cache, since I'm assuming if we're not willing to wait forever 
for a proxy to be freed up we're also not willing to wait forever for a remote 
call to complete.

> ContainerManagementProtocolProxy can create thousands of threads for a large 
> cluster
> 
>
> Key: YARN-2314
> URL: https://issues.apache.org/jira/browse/YARN-2314
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: nmproxycachefix.prototype.patch
>
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
> this cache is configurable.  However the cache can grow far beyond the 
> configured size when running on a large cluster and blow AM address/container 
> limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071738#comment-14071738
 ] 

Hudson commented on YARN-2131:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/])
YARN-2131. Addendum2: Document -format-state-store. Add a way to format the 
RMStateStore. (Robert Kanter via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612634)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm


> Add a way to format the RMStateStore
> 
>
> Key: YARN-2131
> URL: https://issues.apache.org/jira/browse/YARN-2131
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
> Fix For: 2.6.0
>
> Attachments: YARN-2131.patch, YARN-2131.patch, 
> YARN-2131_addendum.patch, YARN-2131_addendum2.patch
>
>
> There are cases when we don't want to recover past applications, but recover 
> applications going forward. To do this, one has to clear the store. Today, 
> there is no easy way to do this and users should understand how each store 
> works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071727#comment-14071727
 ] 

Hudson commented on YARN-2273:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/])
YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612720)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> NPE in ContinuousScheduling thread when we lose a node
> --
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
>Assignee: Wei Yan
> Fix For: 2.6.0
>
> Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071741#comment-14071741
 ] 

Hudson commented on YARN-2319:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/])
YARN-2319. Made the MiniKdc instance start/close before/after the class of 
TestRMWebServicesDelegationTokens. Contributed by Wenwu Peng. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612588)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java


> Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
> ---
>
> Key: YARN-2319
> URL: https://issues.apache.org/jira/browse/YARN-2319
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: resourcemanager
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Wenwu Peng
>Assignee: Wenwu Peng
> Fix For: 2.5.0
>
> Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, 
> YARN-2319.2.patch
>
>
> MiniKdc only invoke start method not stop in 
> TestRMWebServicesDelegationTokens.java
> {code}
> testMiniKDC.start();
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover

2014-07-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071772#comment-14071772
 ] 

Rohith commented on YARN-1779:
--

This is critical issue for work preserving restart feature. AM can not connect 
to new RM because of proxy object is cached and token service is overwritten.
One approach to solve this by cloning the token object and add token to 
UserGroupInformation. Sample like below
{code}
for (Token token : UserGroupInformation
.getCurrentUser().getTokens()) {
  if (token.getKind().equals(AMRMTokenIdentifier.KIND_NAME)) {
Token specificToken = new Token(token);
SecurityUtil.setTokenService(specificToken, resourceManagerAddress);
UserGroupInformation.getCurrentUser().addToken(specificToken);
  }
}
{code}
Does it make sense?

> Handle AMRMTokens across RM failover
> 
>
> Key: YARN-1779
> URL: https://issues.apache.org/jira/browse/YARN-1779
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Priority: Blocker
>  Labels: ha
>
> Verify if AMRMTokens continue to work against RM failover. If not, we will 
> have to do something along the lines of YARN-986. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-07-23 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071778#comment-14071778
 ] 

Zhijie Shen commented on YARN-2301:
---

[~sunilg], thanks for your input. Here's my response.

bq. And  is only for one of the type named appId. May be it 
may confuse user also, like which sub option needs the . 

I don't worry too much about it, because we can update the usage block to let 
users how to use the opts correctly. When users make the mistake, they will be 
redirect the usage output.

bq. I feel may be we can have a new command itself for listing application 
container.

I incline not to change the command to keep backward compatibility.

bq. I feel sequential checks with ConverterUtils.toApplicationID and 
ConverterUtils.toApplicationAttemptId has to be done to know whether input is 
appId|appAttemptId.

We can use ConverterUtils.APPLICATION_PREFIX and 
ConverterUtils.APPLICATION_ATTEMPT_PREFIX to check the prefix of the given id 
to determine whether it is the app id or the app attempt id. We don't need to 
handle the exception actually.

> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1063) Winutils needs ability to create task as domain user

2014-07-23 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-1063:
---

Attachment: YARN-1063.5.patch

Patch .5 changes the environment block of the secure process to inherit the 
parent environment.

> Winutils needs ability to create task as domain user
> 
>
> Key: YARN-1063
> URL: https://issues.apache.org/jira/browse/YARN-1063
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
> Environment: Windows
>Reporter: Kyle Leckie
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
> YARN-1063.5.patch, YARN-1063.patch
>
>
> h1. Summary:
> Securing a Hadoop cluster requires constructing some form of security 
> boundary around the processes executed in YARN containers. Isolation based on 
> Windows user isolation seems most feasible. This approach is similar to the 
> approach taken by the existing LinuxContainerExecutor. The current patch to 
> winutils.exe adds the ability to create a process as a domain user. 
> h1. Alternative Methods considered:
> h2. Process rights limited by security token restriction:
> On Windows access decisions are made by examining the security token of a 
> process. It is possible to spawn a process with a restricted security token. 
> Any of the rights granted by SIDs of the default token may be restricted. It 
> is possible to see this in action by examining the security tone of a 
> sandboxed process launch be a web browser. Typically the launched process 
> will have a fully restricted token and need to access machine resources 
> through a dedicated broker process that enforces a custom security policy. 
> This broker process mechanism would break compatibility with the typical 
> Hadoop container process. The Container process must be able to utilize 
> standard function calls for disk and network IO. I performed some work 
> looking at ways to ACL the local files to the specific launched without 
> granting rights to other processes launched on the same machine but found 
> this to be an overly complex solution. 
> h2. Relying on APP containers:
> Recent versions of windows have the ability to launch processes within an 
> isolated container. Application containers are supported for execution of 
> WinRT based executables. This method was ruled out due to the lack of 
> official support for standard windows APIs. At some point in the future 
> windows may support functionality similar to BSD jails or Linux containers, 
> at that point support for containers should be added.
> h1. Create As User Feature Description:
> h2. Usage:
> A new sub command was added to the set of task commands. Here is the syntax:
> winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
> Some notes:
> * The username specified is in the format of "user@domain"
> * The machine executing this command must be joined to the domain of the user 
> specified
> * The domain controller must allow the account executing the command access 
> to the user information. For this join the account to the predefined group 
> labeled "Pre-Windows 2000 Compatible Access"
> * The account running the command must have several rights on the local 
> machine. These can be managed manually using secpol.msc: 
> ** "Act as part of the operating system" - SE_TCB_NAME
> ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME
> ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME
> * The launched process will not have rights to the desktop so will not be 
> able to display any information or create UI.
> * The launched process will have no network credentials. Any access of 
> network resources that requires domain authentication will fail.
> h2. Implementation:
> Winutils performs the following steps:
> # Enable the required privileges for the current process.
> # Register as a trusted process with the Local Security Authority (LSA).
> # Create a new logon for the user passed on the command line.
> # Load/Create a profile on the local machine for the new logon.
> # Create a new environment for the new logon.
> # Launch the new process in a job with the task name specified and using the 
> created logon.
> # Wait for the JOB to exit.
> h2. Future work:
> The following work was scoped out of this check in:
> * Support for non-domain users or machine that are not domain joined.
> * Support for privilege isolation by running the task launcher in a high 
> privilege service with access over an ACLed named pipe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1972) Implement secure Windows Container Executor

2014-07-23 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-1972:
---

Attachment: YARN-1972.3.patch

Patch .3 reverts the separation of createUserAppCacheDirs, as per review 
comment, and ads 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm


> Implement secure Windows Container Executor
> ---
>
> Key: YARN-1972
> URL: https://issues.apache.org/jira/browse/YARN-1972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch
>
>
> h1. Windows Secure Container Executor (WCE)
> YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
> user as a solution for the problem of having a security boundary between 
> processes executed in YARN containers and the Hadoop services. The WCE is a 
> container executor that leverages the winutils capabilities introduced in 
> YARN-1063 and launches containers as an OS process running as the job 
> submitter user. A description of the S4U infrastructure used by YARN-1063 
> alternatives considered can be read on that JIRA.
> The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
> drive the flow of execution, but it overwrrides some emthods to the effect of:
> * change the DCE created user cache directories to be owned by the job user 
> and by the nodemanager group.
> * changes the actual container run command to use the 'createAsUser' command 
> of winutils task instead of 'create'
> * runs the localization as standalone process instead of an in-process Java 
> method call. This in turn relies on the winutil createAsUser feature to run 
> the localization as the job user.
>  
> When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
> differences:
> * it does no delegate the creation of the user cache directories to the 
> native implementation.
> * it does no require special handling to be able to delete user files
> The approach on the WCE came from a practical trial-and-error approach. I had 
> to iron out some issues around the Windows script shell limitations (command 
> line length) to get it to work, the biggest issue being the huge CLASSPATH 
> that is commonplace in Hadoop environment container executions. The job 
> container itself is already dealing with this via a so called 'classpath 
> jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
> as a separate container the same issue had to be resolved and I used the same 
> 'classpath jar' approach.
> h2. Deployment Requirements
> To use the WCE one needs to set the 
> `yarn.nodemanager.container-executor.class` to 
> `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
> and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
> Windows security group name that is the nodemanager service principal is a 
> member of (equivalent of LCE 
> `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
> does not require any configuration outside of the Hadoop own's yar-site.xml.
> For WCE to work the nodemanager must run as a service principal that is 
> member of the local Administrators group or LocalSystem. this is derived from 
> the need to invoke LoadUserProfile API which mention these requirements in 
> the specifications. This is in addition to the SE_TCB privilege mentioned in 
> YARN-1063, but this requirement will automatically imply that the SE_TCB 
> privilege is held by the nodemanager. For the Linux speakers in the audience, 
> the requirement is basically to run NM as root.
> h2. Dedicated high privilege Service
> Due to the high privilege required by the WCE we had discussed the need to 
> isolate the high privilege operations into a separate process, an 'executor' 
> service that is solely responsible to start the containers (incloding the 
> localizer). The NM would have to authenticate, authorize and communicate with 
> this service via an IPC mechanism and use this service to launch the 
> containers. I still believe we'll end up deploying such a service, but the 
> effort to onboard such a new platfrom specific new service on the project are 
> not trivial.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071781#comment-14071781
 ] 

Hadoop QA commented on YARN-1063:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657356/YARN-1063.5.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4404//console

This message is automatically generated.

> Winutils needs ability to create task as domain user
> 
>
> Key: YARN-1063
> URL: https://issues.apache.org/jira/browse/YARN-1063
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
> Environment: Windows
>Reporter: Kyle Leckie
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
> YARN-1063.5.patch, YARN-1063.patch
>
>
> h1. Summary:
> Securing a Hadoop cluster requires constructing some form of security 
> boundary around the processes executed in YARN containers. Isolation based on 
> Windows user isolation seems most feasible. This approach is similar to the 
> approach taken by the existing LinuxContainerExecutor. The current patch to 
> winutils.exe adds the ability to create a process as a domain user. 
> h1. Alternative Methods considered:
> h2. Process rights limited by security token restriction:
> On Windows access decisions are made by examining the security token of a 
> process. It is possible to spawn a process with a restricted security token. 
> Any of the rights granted by SIDs of the default token may be restricted. It 
> is possible to see this in action by examining the security tone of a 
> sandboxed process launch be a web browser. Typically the launched process 
> will have a fully restricted token and need to access machine resources 
> through a dedicated broker process that enforces a custom security policy. 
> This broker process mechanism would break compatibility with the typical 
> Hadoop container process. The Container process must be able to utilize 
> standard function calls for disk and network IO. I performed some work 
> looking at ways to ACL the local files to the specific launched without 
> granting rights to other processes launched on the same machine but found 
> this to be an overly complex solution. 
> h2. Relying on APP containers:
> Recent versions of windows have the ability to launch processes within an 
> isolated container. Application containers are supported for execution of 
> WinRT based executables. This method was ruled out due to the lack of 
> official support for standard windows APIs. At some point in the future 
> windows may support functionality similar to BSD jails or Linux containers, 
> at that point support for containers should be added.
> h1. Create As User Feature Description:
> h2. Usage:
> A new sub command was added to the set of task commands. Here is the syntax:
> winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
> Some notes:
> * The username specified is in the format of "user@domain"
> * The machine executing this command must be joined to the domain of the user 
> specified
> * The domain controller must allow the account executing the command access 
> to the user information. For this join the account to the predefined group 
> labeled "Pre-Windows 2000 Compatible Access"
> * The account running the command must have several rights on the local 
> machine. These can be managed manually using secpol.msc: 
> ** "Act as part of the operating system" - SE_TCB_NAME
> ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME
> ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME
> * The launched process will not have rights to the desktop so will not be 
> able to display any information or create UI.
> * The launched process will have no network credentials. Any access of 
> network resources that requires domain authentication will fail.
> h2. Implementation:
> Winutils performs the following steps:
> # Enable the required privileges for the current process.
> # Register as a trusted process with the Local Security Authority (LSA).
> # Create a new logon for the user passed on the command line.
> # Load/Create a profile on the local machine for the new logon.
> # Create a new environment for the new logon.
> # Launch the new process in a job with the task name specified and using the 
> created logon.
> # Wait for the JOB to exit.
> h2. Future work:
> The following work was scoped out of this check in:
> * Support for non-domain users or machine that are not domain joined.
> * Support for privilege isolation by run

[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-07-23 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---

Attachment: YARN-2198.2.patch

Patch .2 enables mutual auth on LRPC. TODO: sparate config for the service from 
yarn-site.xml and update SecureExecutor.apt.vm to reflect the reality of 
YARN-2198

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2198.1.patch, YARN-2198.2.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a the process launching the container to be LocalSystem or 
> a member of the a local Administrators group. Since the process in question 
> is the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-23 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2247:


Attachment: apache-yarn-2247.4.patch

{quote}
Varun Vasudev, thanks for your patience on my comments. The new patch looks 
almost good to me. Just some nits:

1. Should not be necessary. Always load TimelineAuthenticationFilter. With 
"simple" type, still the pseudo handler is to used.
{noformat}
+if (authType.equals("simple") && 
!UserGroupInformation.isSecurityEnabled()) {
+  container.addFilter("authentication",
+AuthenticationFilter.class.getName(), filterConfig);
+  return;
+}
{noformat}
{quote}
Good point. Fixed.

{quote}
2. Check not null first for testMiniKDC and rm? Same for 
TestRMWebappAuthentication
{noformat}
+testMiniKDC.stop();
+rm.stop();
{noformat}
{quote}
Fixed.

{quote}
3. I didn't find the logic to forbid it. Anyway, is it good to mention it in 
the document as well?
{noformat}
+  // Test to make sure that we can't do delegation token
+  // functions using just delegation token auth
{noformat}
{quote}
The test is in RMWebServices.
{noformat}
callerUGI = createKerberosUserGroupInformation(hsr);
{noformat}
which in turn has this check 
{noformat}
String authType = hsr.getAuthType();
if (!KerberosAuthenticationHandler.TYPE.equals(authType)) {
  String msg =
  "Delegation token operations can only be carried out on a "
  + "Kerberos authenticated channel";
  throw new YarnException(msg);
}
{noformat}

I've documented it under the delegation token rest API section:
{noformat}
 All delegation token requests must be carried out on a Kerberos authenticated 
connection(using SPNEGO).
{noformat}

> Allow RM web services users to authenticate using delegation tokens
> ---
>
> Key: YARN-2247
> URL: https://issues.apache.org/jira/browse/YARN-2247
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, 
> apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, apache-yarn-2247.4.patch
>
>
> The RM webapp should allow users to authenticate using delegation tokens to 
> maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1342) Recover container tokens upon nodemanager restart

2014-07-23 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1342:
-

Attachment: YARN-1342v6.patch

Thanks for the review, Junping!

bq. Would you confirm my understanding is correct? If so, the following code 
may not be necessary?

Yes, that's correct.  Sorry, I meant to remove that code to match the same 
behavior from NMContainerTokenSecretManagerInNM and forgot to do so. Thanks for 
catching this, and I updated the patch accordingly.


> Recover container tokens upon nodemanager restart
> -
>
> Key: YARN-1342
> URL: https://issues.apache.org/jira/browse/YARN-1342
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1342.patch, YARN-1342v2.patch, 
> YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch, 
> YARN-1342v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071829#comment-14071829
 ] 

Hudson commented on YARN-2319:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/])
YARN-2319. Made the MiniKdc instance start/close before/after the class of 
TestRMWebServicesDelegationTokens. Contributed by Wenwu Peng. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612588)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java


> Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
> ---
>
> Key: YARN-2319
> URL: https://issues.apache.org/jira/browse/YARN-2319
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: resourcemanager
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Wenwu Peng
>Assignee: Wenwu Peng
> Fix For: 2.5.0
>
> Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, 
> YARN-2319.2.patch
>
>
> MiniKdc only invoke start method not stop in 
> TestRMWebServicesDelegationTokens.java
> {code}
> testMiniKDC.start();
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071816#comment-14071816
 ] 

Hudson commented on YARN-2313:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/])
YARN-2313. Livelock can occur in FairScheduler when there are lots of running 
apps (Tsuyoshi Ozawa via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612769)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071826#comment-14071826
 ] 

Hudson commented on YARN-2131:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/])
YARN-2131. Addendum2: Document -format-state-store. Add a way to format the 
RMStateStore. (Robert Kanter via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612634)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm


> Add a way to format the RMStateStore
> 
>
> Key: YARN-2131
> URL: https://issues.apache.org/jira/browse/YARN-2131
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Robert Kanter
> Fix For: 2.6.0
>
> Attachments: YARN-2131.patch, YARN-2131.patch, 
> YARN-2131_addendum.patch, YARN-2131_addendum2.patch
>
>
> There are cases when we don't want to recover past applications, but recover 
> applications going forward. To do this, one has to clear the store. Today, 
> there is no easy way to do this and users should understand how each store 
> works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071818#comment-14071818
 ] 

Hudson commented on YARN-2295:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/])
YARN-2295. Refactored DistributedShell to use public APIs of protocol records. 
Contributed by Li Lu (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612626)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java


> Refactor YARN distributed shell with existing public stable API
> ---
>
> Key: YARN-2295
> URL: https://issues.apache.org/jira/browse/YARN-2295
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, 
> YARN-2295-071514.patch, YARN-2295-072114.patch
>
>
> Some API calls in YARN distributed shell have been marked as unstable and 
> private. Use existing public stable API to replace them, if possible. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071830#comment-14071830
 ] 

Hudson commented on YARN-2242:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/])
YARN-2242. Addendum patch. Improve exception information on AM launch crashes. 
(Contributed by Li Lu) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612565)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


> Improve exception information on AM launch crashes
> --
>
> Key: YARN-2242
> URL: https://issues.apache.org/jira/browse/YARN-2242
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, 
> YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, 
> YARN-2242-071414.patch
>
>
> Now on each time AM Container crashes during launch, both the console and the 
> webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, 
> but sometimes confusing. With the help of log aggregator, container logs are 
> actually aggregated, and can be very helpful for debugging. One possible way 
> to improve the whole process is to send a "pointer" to the aggregated logs to 
> the programmer when reporting exception information. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071815#comment-14071815
 ] 

Hudson commented on YARN-2273:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/])
YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612720)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> NPE in ContinuousScheduling thread when we lose a node
> --
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
>Assignee: Wei Yan
> Fix For: 2.6.0
>
> Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-07-23 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071838#comment-14071838
 ] 

Sunil G commented on YARN-2301:
---

bq.we can update the usage block to let users how to use the opts correctly. 
When users make the mistake, they will be redirect the usage output
+1. Yes. User can be redirected back correct usage.





> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2340) NPE thrown when RM restart after queue is STOPPED

2014-07-23 Thread Nishan Shetty (JIRA)
Nishan Shetty created YARN-2340:
---

 Summary: NPE thrown when RM restart after queue is STOPPED
 Key: YARN-2340
 URL: https://issues.apache.org/jira/browse/YARN-2340
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.4.1
 Environment: Capacityscheduler with Queue a, b

Reporter: Nishan Shetty
Priority: Critical


While job is in progress make Queue  state as STOPPED and then restart RM 

Observe that standby RM fails to come up as acive throwing below NPE

2014-07-23 18:43:24,432 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED
2014-07-23 18:43:24,433 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602)
 at java.lang.Thread.run(Thread.java:662)
2014-07-23 18:43:24,434 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2340) NPE thrown when RM restart after queue is STOPPED

2014-07-23 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty resolved YARN-2340.
-

Resolution: Unresolved

> NPE thrown when RM restart after queue is STOPPED
> -
>
> Key: YARN-2340
> URL: https://issues.apache.org/jira/browse/YARN-2340
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.4.1
> Environment: Capacityscheduler with Queue a, b
>Reporter: Nishan Shetty
>Priority: Critical
>
> While job is in progress make Queue  state as STOPPED and then restart RM 
> Observe that standby RM fails to come up as acive throwing below NPE
> 2014-07-23 18:43:24,432 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED
> 2014-07-23 18:43:24,433 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602)
>  at java.lang.Thread.run(Thread.java:662)
> 2014-07-23 18:43:24,434 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (YARN-2340) NPE thrown when RM restart after queue is STOPPED

2014-07-23 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty reopened YARN-2340:
-


> NPE thrown when RM restart after queue is STOPPED
> -
>
> Key: YARN-2340
> URL: https://issues.apache.org/jira/browse/YARN-2340
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.4.1
> Environment: Capacityscheduler with Queue a, b
>Reporter: Nishan Shetty
>Priority: Critical
>
> While job is in progress make Queue  state as STOPPED and then restart RM 
> Observe that standby RM fails to come up as acive throwing below NPE
> 2014-07-23 18:43:24,432 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED
> 2014-07-23 18:43:24,433 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602)
>  at java.lang.Thread.run(Thread.java:662)
> 2014-07-23 18:43:24,434 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-07-23 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071853#comment-14071853
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

Thanks for the comments, Jian, Zhijie and Sid.

{quote}
For example, ContainerTokenIdentifier serializes a long (getContainerId()) at 
RM side, but deserializes a int (getId()) at NM side. In this case, I'm afraid 
it's going to be wrong
{quote}

If we think the backward compatibility as first priority, we can choose the 
first design I proposed as Sid mentioned. This design choice looks reasonable 
to me. [~jianhe], what do you think? We discussed that we should avoid 
introducing new field to ContinerId class. In my opinion, this reason is weaker 
than the backward compatibility.

{quote}
ConverterUtils is a separate consideration. It is marked as @private - but is 
used in MapReduce for example (and also in Tez). Looks like the toString method 
isn't being changed either, whcih means to ConverterUtils method would continue 
to work.
{quote}

I'm thinking to suffix the epoch at the end of container id. I'll work with old 
jar which includes old {{ConverterUtils#toContainerId}}. YARN-2182 is the JIRA 
to address the change of {{ConverterUtils#toContainerId}}.



> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-07-23 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071859#comment-14071859
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

{quote}
I'm not sure it's good to makr a @Stable method back to @Unstable
Agree with Zhijie on not changing an @Stable method to @Unstable. Deprecate in 
this patch itself ?
{quote}

@Stable or @Unstable discussion will be disappeared if we decide to continue to 
use {{getId}}. I'd like to decide it before continuing the discussion.

{quote}
hashCode and equals are inconsistent in the latest patch. One uses getId(), the 
other uses getContainerId
{quote}

This is my mistake, I'll update it in the next patch. Or, if we decide to 
continue to use {{getId}, I'll revert it.

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071865#comment-14071865
 ] 

Hadoop QA commented on YARN-1342:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657361/YARN-1342v6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4406//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4406//console

This message is automatically generated.

> Recover container tokens upon nodemanager restart
> -
>
> Key: YARN-1342
> URL: https://issues.apache.org/jira/browse/YARN-1342
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1342.patch, YARN-1342v2.patch, 
> YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch, 
> YARN-1342v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071878#comment-14071878
 ] 

Hadoop QA commented on YARN-2247:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657359/apache-yarn-2247.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4405//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4405//console

This message is automatically generated.

> Allow RM web services users to authenticate using delegation tokens
> ---
>
> Key: YARN-2247
> URL: https://issues.apache.org/jira/browse/YARN-2247
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, 
> apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, apache-yarn-2247.4.patch
>
>
> The RM webapp should allow users to authenticate using delegation tokens to 
> maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart

2014-07-23 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2262:


Attachment: yarn-testos-resourcemanager-HOST-10-18-40-84.log
yarn-testos-historyserver-HOST-10-18-40-95.log
Capture1.PNG
Capture.PNG
yarn-testos-resourcemanager-HOST-10-18-40-95.log

> Few fields displaying wrong values in Timeline server after RM restart
> --
>
> Key: YARN-2262
> URL: https://issues.apache.org/jira/browse/YARN-2262
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.4.0
>Reporter: Nishan Shetty
>Assignee: Naganarasimha G R
> Attachments: Capture.PNG, Capture1.PNG, 
> yarn-testos-historyserver-HOST-10-18-40-95.log, 
> yarn-testos-resourcemanager-HOST-10-18-40-84.log, 
> yarn-testos-resourcemanager-HOST-10-18-40-95.log
>
>
> Few fields displaying wrong values in Timeline server after RM restart
> State:null
> FinalStatus:  UNDEFINED
> Started:  8-Jul-2014 14:58:08
> Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart

2014-07-23 Thread Nishan Shetty (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071944#comment-14071944
 ] 

Nishan Shetty commented on YARN-2262:
-

[~zjshen] Attached logs
Application id is application_1406114813957_0002

> Few fields displaying wrong values in Timeline server after RM restart
> --
>
> Key: YARN-2262
> URL: https://issues.apache.org/jira/browse/YARN-2262
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.4.0
>Reporter: Nishan Shetty
>Assignee: Naganarasimha G R
> Attachments: Capture.PNG, Capture1.PNG, 
> yarn-testos-historyserver-HOST-10-18-40-95.log, 
> yarn-testos-resourcemanager-HOST-10-18-40-84.log, 
> yarn-testos-resourcemanager-HOST-10-18-40-95.log
>
>
> Few fields displaying wrong values in Timeline server after RM restart
> State:null
> FinalStatus:  UNDEFINED
> Started:  8-Jul-2014 14:58:08
> Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-07-23 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071985#comment-14071985
 ] 

Naganarasimha G R commented on YARN-2301:
-

Thanks [~zjshen],[~sunilg],[~devaraj.k] & [~jianhe] for the comments,
I will start modifying as per [~zjshen]'s approach and try to provide the patch 
at the earliest.


> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071999#comment-14071999
 ] 

zhihai xu commented on YARN-2337:
-

[~ozawa] thanks for your quick response.

> remove duplication function call (setClientRMService) in resource manage class
> --
>
> Key: YARN-2337
> URL: https://issues.apache.org/jira/browse/YARN-2337
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-2337.000.patch
>
>
> remove duplication function call (setClientRMService) in resource manage 
> class.
> rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
> ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps

2014-07-23 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072042#comment-14072042
 ] 

Karthik Kambatla commented on YARN-2313:


Thanks for the explanation, [~ozawa]. I see the issue clearly now. 

In that case, a better approach might be to have a single "maintenance" thread 
that periodically executes a bunch of runnables (reload, update, 
continuous-scheduling) serially. Otherwise, as we add more threads that hold 
onto the scheduler lock, it will be hairy to tune all of them so the scheduler 
can make some meaningful progress. 

> Livelock can occur in FairScheduler when there are lots of running apps
> ---
>
> Key: YARN-2313
> URL: https://issues.apache.org/jira/browse/YARN-2313
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Fix For: 2.6.0
>
> Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, 
> YARN-2313.4.patch, rm-stack-trace.txt
>
>
> Observed livelock on FairScheduler when there are lots entry in queue. After 
> my investigating code, following case can occur:
> 1. {{update()}} called by UpdateThread takes longer times than 
> UPDATE_INTERVAL(500ms) if there are lots queue.
> 2. UpdateThread goes busy loop.
> 3. Other threads(AllocationFileReloader, 
> ResourceManager$SchedulerEventDispatcher) can wait forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-23 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2212:


Attachment: YARN-2212.2.patch

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-23 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072086#comment-14072086
 ] 

Xuan Gong commented on YARN-2212:
-

Merged YARN-2237 together.

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (YARN-2341) Refactor TestCapacityScheduler to separate tests per feature

2014-07-23 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer moved MAPREDUCE-723 to YARN-2341:
--

Component/s: (was: capacity-sched)
 capacityscheduler
 Issue Type: Test  (was: Bug)
Key: YARN-2341  (was: MAPREDUCE-723)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> Refactor TestCapacityScheduler to separate tests per feature
> 
>
> Key: YARN-2341
> URL: https://issues.apache.org/jira/browse/YARN-2341
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>
> TestCapacityScheduler has grown rapidly over time. It now has tests for 
> various features interspersed amongst each other. It would be helpful to 
> separate out tests per feature, moving out the central mock objects to a 
> primary test class.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2342) When killing a task, we don't always need to send a subsequent SIGKILL

2014-07-23 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2342:
---

Labels: newbie  (was: )

> When killing a task, we don't always need to send a subsequent SIGKILL
> --
>
> Key: YARN-2342
> URL: https://issues.apache.org/jira/browse/YARN-2342
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Vinod Kumar Vavilapalli
>  Labels: newbie
>
> In both TaskController/LinuxTaskController, while killing tasks, first a 
> SIGTERM and then a subsequent SIGKILL. We don't need to send the SIGKILL 
> always. It can be avoided when the SIGTERM command (kill pid for process or 
> kill -- -pid for session) returns a non-zero exit code, i.e. when the signal 
> is not sent successfully because process/process group doesn't exist. 'man 2 
> kill' says exit code is non-zero only when process/process group is not alive 
> or invalid signal is specified or the process doesn't have permissions. The 
> last two don't happen in mapred code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2342) When killing a task, we don't always need to send a subsequent SIGKILL

2014-07-23 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072249#comment-14072249
 ] 

Allen Wittenauer commented on YARN-2342:


Moving this to YARN, as we need to check container executor.

> When killing a task, we don't always need to send a subsequent SIGKILL
> --
>
> Key: YARN-2342
> URL: https://issues.apache.org/jira/browse/YARN-2342
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Vinod Kumar Vavilapalli
>  Labels: newbie
>
> In both TaskController/LinuxTaskController, while killing tasks, first a 
> SIGTERM and then a subsequent SIGKILL. We don't need to send the SIGKILL 
> always. It can be avoided when the SIGTERM command (kill pid for process or 
> kill -- -pid for session) returns a non-zero exit code, i.e. when the signal 
> is not sent successfully because process/process group doesn't exist. 'man 2 
> kill' says exit code is non-zero only when process/process group is not alive 
> or invalid signal is specified or the process doesn't have permissions. The 
> last two don't happen in mapred code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (YARN-2342) When killing a task, we don't always need to send a subsequent SIGKILL

2014-07-23 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer moved MAPREDUCE-780 to YARN-2342:
--

Issue Type: Improvement  (was: Bug)
   Key: YARN-2342  (was: MAPREDUCE-780)
   Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> When killing a task, we don't always need to send a subsequent SIGKILL
> --
>
> Key: YARN-2342
> URL: https://issues.apache.org/jira/browse/YARN-2342
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Vinod Kumar Vavilapalli
>  Labels: newbie
>
> In both TaskController/LinuxTaskController, while killing tasks, first a 
> SIGTERM and then a subsequent SIGKILL. We don't need to send the SIGKILL 
> always. It can be avoided when the SIGTERM command (kill pid for process or 
> kill -- -pid for session) returns a non-zero exit code, i.e. when the signal 
> is not sent successfully because process/process group doesn't exist. 'man 2 
> kill' says exit code is non-zero only when process/process group is not alive 
> or invalid signal is specified or the process doesn't have permissions. The 
> last two don't happen in mapred code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2343) Improve

2014-07-23 Thread Li Lu (JIRA)
Li Lu created YARN-2343:
---

 Summary: Improve
 Key: YARN-2343
 URL: https://issues.apache.org/jira/browse/YARN-2343
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Li Lu
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Moved] (YARN-2344) Provide a mechanism to pause the jobtracker

2014-07-23 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer moved MAPREDUCE-828 to YARN-2344:
--

Component/s: (was: jobtracker)
 resourcemanager
Key: YARN-2344  (was: MAPREDUCE-828)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> Provide a mechanism to pause the jobtracker
> ---
>
> Key: YARN-2344
> URL: https://issues.apache.org/jira/browse/YARN-2344
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: Hemanth Yamijala
>
> We've seen scenarios when we have needed to stop the namenode for a 
> maintenance activity. In such scenarios, if the jobtracker (JT) continues to 
> run, jobs would fail due to initialization or task failures (due to DFS). We 
> could restart the JT enabling job recovery, during such scenarios. But 
> restart has proved to be a very intrusive activity, particularly if the JT is 
> not at fault itself and does not require a restart. The ask is for a 
> admin-controlled feature to pause the JT which would take it to a state 
> somewhat analogous to the safe mode of DFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2344) Provide a mechanism to pause the jobtracker

2014-07-23 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072323#comment-14072323
 ] 

Allen Wittenauer commented on YARN-2344:


Moving this to YARN.

We still need a way to pause the Resource Manager from accepting new 
submissions.

> Provide a mechanism to pause the jobtracker
> ---
>
> Key: YARN-2344
> URL: https://issues.apache.org/jira/browse/YARN-2344
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Reporter: Hemanth Yamijala
>
> We've seen scenarios when we have needed to stop the namenode for a 
> maintenance activity. In such scenarios, if the jobtracker (JT) continues to 
> run, jobs would fail due to initialization or task failures (due to DFS). We 
> could restart the JT enabling job recovery, during such scenarios. But 
> restart has proved to be a very intrusive activity, particularly if the JT is 
> not at fault itself and does not require a restart. The ask is for a 
> admin-controlled feature to pause the JT which would take it to a state 
> somewhat analogous to the safe mode of DFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2343) Improve error message on token expire exception

2014-07-23 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2343:


Description: Some of token expire exception is triggered by wrong time 
settings on cluster nodes, but the current exception message does not 
explicitly address that. It would be helpful to add some message explicitly 
pointing out that this exception could be caused by machines out of sync in 
time, or even wrong time zone settings. 
   Assignee: Li Lu
 Labels: usability  (was: )
Summary: Improve error message on token expire exception  (was: Improve)

> Improve error message on token expire exception
> ---
>
> Key: YARN-2343
> URL: https://issues.apache.org/jira/browse/YARN-2343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Li Lu
>Assignee: Li Lu
>Priority: Trivial
>  Labels: usability
>
> Some of token expire exception is triggered by wrong time settings on cluster 
> nodes, but the current exception message does not explicitly address that. It 
> would be helpful to add some message explicitly pointing out that this 
> exception could be caused by machines out of sync in time, or even wrong time 
> zone settings. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2343) Improve error message on token expire exception

2014-07-23 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2343:


Attachment: YARN-2343-072314.patch

Adding the message to point out that the token expire exception could be caused 
by machine out of sync, or wrong timezone settings. 

> Improve error message on token expire exception
> ---
>
> Key: YARN-2343
> URL: https://issues.apache.org/jira/browse/YARN-2343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Li Lu
>Assignee: Li Lu
>Priority: Trivial
>  Labels: usability
> Attachments: YARN-2343-072314.patch
>
>
> Some of token expire exception is triggered by wrong time settings on cluster 
> nodes, but the current exception message does not explicitly address that. It 
> would be helpful to add some message explicitly pointing out that this 
> exception could be caused by machines out of sync in time, or even wrong time 
> zone settings. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2345) yarn rmadin -report

2014-07-23 Thread Allen Wittenauer (JIRA)
Allen Wittenauer created YARN-2345:
--

 Summary: yarn rmadin -report
 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer


It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-07-23 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072356#comment-14072356
 ] 

Siddharth Seth commented on YARN-2229:
--

[~ozawa] - I was primarily looking at this from a  backward compatibility 
perspective. Will leave the decision to go with the current approach or adding 
a hidden field to you, Jian and Zhijie.

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2345) yarn rmadmin -report

2014-07-23 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2345:
---

Summary: yarn rmadmin -report  (was: yarn rmadin -report)

> yarn rmadmin -report
> 
>
> Key: YARN-2345
> URL: https://issues.apache.org/jira/browse/YARN-2345
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Allen Wittenauer
>
> It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2345) yarn rmadmin -report

2014-07-23 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2345:
---

Labels: newbie  (was: )

> yarn rmadmin -report
> 
>
> Key: YARN-2345
> URL: https://issues.apache.org/jira/browse/YARN-2345
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Allen Wittenauer
>  Labels: newbie
>
> It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2346) Add a 'status' command to yarn-daemon.sh

2014-07-23 Thread Nikunj Bansal (JIRA)
Nikunj Bansal created YARN-2346:
---

 Summary: Add a 'status' command to yarn-daemon.sh
 Key: YARN-2346
 URL: https://issues.apache.org/jira/browse/YARN-2346
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1, 2.4.0, 2.3.0, 2.2.0
Reporter: Nikunj Bansal
Priority: Minor


Adding a 'status' command to yarn-daemon.sh will be useful for finding out the 
status of yarn daemons.

Running the 'status' command should exit with a 0 exit code if the target 
daemon is running and non-zero code in case its not.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2346) Add a 'status' command to yarn-daemon.sh

2014-07-23 Thread Nikunj Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikunj Bansal updated YARN-2346:


Affects Version/s: 2.2.1

> Add a 'status' command to yarn-daemon.sh
> 
>
> Key: YARN-2346
> URL: https://issues.apache.org/jira/browse/YARN-2346
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1
>Reporter: Nikunj Bansal
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Adding a 'status' command to yarn-daemon.sh will be useful for finding out 
> the status of yarn daemons.
> Running the 'status' command should exit with a 0 exit code if the target 
> daemon is running and non-zero code in case its not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails

2014-07-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072390#comment-14072390
 ] 

Jason Lowe commented on YARN-2147:
--

The test timeouts were an artifact of a period where we were requiring each 
test to have a timeout to work around a surefire timeout bug, but we no longer 
need each test to have one.  It's not going to hurt if present even for tests 
that shouldn't need them as long as the timeout is reasonable for the test.

+1 lgtm.  Committing this.

> client lacks delegation token exception details when application submit fails
> -
>
> Key: YARN-2147
> URL: https://issues.apache.org/jira/browse/YARN-2147
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Jason Lowe
>Assignee: Chen He
>Priority: Minor
> Attachments: YARN-2147-v2.patch, YARN-2147-v3.patch, 
> YARN-2147-v4.patch, YARN-2147-v5.patch, YARN-2147.patch
>
>
> When an client submits an application and the delegation token process fails 
> the client can lack critical details needed to understand the nature of the 
> error.  Only the message of the error exception is conveyed to the client, 
> which sometimes isn't enough to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2343) Improve error message on token expire exception

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072394#comment-14072394
 ] 

Hadoop QA commented on YARN-2343:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657437/YARN-2343-072314.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4407//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4407//console

This message is automatically generated.

> Improve error message on token expire exception
> ---
>
> Key: YARN-2343
> URL: https://issues.apache.org/jira/browse/YARN-2343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Li Lu
>Assignee: Li Lu
>Priority: Trivial
>  Labels: usability
> Attachments: YARN-2343-072314.patch
>
>
> Some of token expire exception is triggered by wrong time settings on cluster 
> nodes, but the current exception message does not explicitly address that. It 
> would be helpful to add some message explicitly pointing out that this 
> exception could be caused by machines out of sync in time, or even wrong time 
> zone settings. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2214) preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness

2014-07-23 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072409#comment-14072409
 ] 

Ashwin Shankar commented on YARN-2214:
--

[~kasha], [~sandyr] Can one of you please look at this one ? 
Thanks in advance !


> preemptContainerPreCheck() in FSParentQueue delays convergence towards 
> fairness
> ---
>
> Key: YARN-2214
> URL: https://issues.apache.org/jira/browse/YARN-2214
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Attachments: YARN-2214-v1.txt
>
>
> preemptContainerPreCheck() in FSParentQueue rejects preemption requests if 
> the parent queue is below fair share. This can cause a delay in converging 
> towards fairness when the starved leaf queue and the queue above fairshare 
> belong under a non-root parent queue(ie their least common ancestor is a 
> parent queue which is not root).
> Here is an example :
> root.parent has fair share = 80% and usage = 80%
> root.parent.child1 has fair share =40% usage = 80%
> root.parent.child2 has fair share=40% usage=0%
> Now a job is submitted to child2 and the demand is 40%.
> Preemption will kick in and try to reclaim all the 40% from child1.
> When it preempts the first container from child1,the usage of root.parent 
> will become <80%, which is less than root.parent's fair share,causing 
> preemption to stop.So only one container gets preempted in this round 
> although the need is a lot more. child2 would eventually get to half its fair 
> share but only after multiple rounds of preemption.
> Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it 
> only in FSLeafQueue(which is already there).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072426#comment-14072426
 ] 

Hudson commented on YARN-2147:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5956 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5956/])
YARN-2147. client lacks delegation token exception details when application 
submit fails. Contributed by Chen He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612950)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java


> client lacks delegation token exception details when application submit fails
> -
>
> Key: YARN-2147
> URL: https://issues.apache.org/jira/browse/YARN-2147
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Jason Lowe
>Assignee: Chen He
>Priority: Minor
> Fix For: 3.0.0, 2.6.0
>
> Attachments: YARN-2147-v2.patch, YARN-2147-v3.patch, 
> YARN-2147-v4.patch, YARN-2147-v5.patch, YARN-2147.patch
>
>
> When an client submits an application and the delegation token process fails 
> the client can lack critical details needed to understand the nature of the 
> error.  Only the message of the error exception is conveyed to the client, 
> which sometimes isn't enough to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-07-23 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072507#comment-14072507
 ] 

Li Lu commented on YARN-2314:
-

Yes, that makes sense. And I do agree that a quick fix to the problem. 

> ContainerManagementProtocolProxy can create thousands of threads for a large 
> cluster
> 
>
> Key: YARN-2314
> URL: https://issues.apache.org/jira/browse/YARN-2314
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: nmproxycachefix.prototype.patch
>
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
> this cache is configurable.  However the cache can grow far beyond the 
> configured size when running on a large cluster and blow AM address/container 
> limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-23 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:


Attachment: YARN-415.201407232237.txt

[~leftnoteasy], Thank you for your reply.

I have implemented the following changes with the current patch.

{quote}
1. Revert changes of SchedulerAppReport, we already have changed 
ApplicationResourceUsageReport, and memory utilization should be a part of 
resource usage report.
{quote}
Changes to SchedulerAppReport have been reverted.

{quote}
2. Remove getMemory(VCore)Seconds from RMAppAttempt, modify 
RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running 
resource utilization.
{quote}
I have removed getters and setters from RMAppAttempt and added 
RMAppAttemptMetrics#getResourceUtilization, which returns a single 
ResourceUtilization instance that contains both memorySeconds and vcoreSeconds 
for the appAttempt. These include both finished and running statistics IF the 
appAttempt is ALSO the current attempt. If not, it only includes the finished 
statistics.

{quote}
3. put
{code}
 ._("Resources:",
String.format("%d MB-seconds, %d vcore-seconds", 
app.getMemorySeconds(), app.getVcoreSeconds()))
{code}
from "Application Overview" to "Application Metrics", and rename it to 
"Resource Seconds". It should be considered as a part of application metrics 
instead of overview.
{quote}
Changes completed.

{quote}
4. Change finishedMemory/VCoreSeconds to AtomicLong in RMAppAttemptMetrics to 
make it can be efficiently accessed by multi-thread.
{quote}
Changes completed.

{quote}
5. I think it's better to add a new method in SchedulerApplicationAttempt like 
getMemoryUtilization, which will only return memory/cpu seconds. We do this to 
prevent locking scheduling thread when showing application metrics on web UI.
 getMemoryUtilization will be used by 
RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running 
resource utilization. And used by 
SchedulerApplicationAttempt#getResourceUsageReport as well.

The MemoryUtilization class may contain two fields: 
runningContainerMemory(VCore)Seconds.
{quote}
Added ResourceUtilization (instead of MemoryUtilization), but did not make the 
other changes as per comment:
https://issues.apache.org/jira/browse/YARN-415?focusedCommentId=14071181&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14071181

{quote}
6. Since compute running container resource utilization is not O(1), we need 
scan all containers under an application. I think it's better to cache a 
previous compute result, and it will be recomputed after several seconds (maybe 
1-3 seconds should be enough) elapsed.
{quote}
I added chached values in SchedulerApplicationAttempt for memorySeconds and 
vcoreSeconds that are updated when 1) a request is received to calculate these 
metrics, AND 2) it has been more than 3 seconds since the last request.


One thing I did notice when these values are cached is that there is a race 
where containers can get counted twice:
- RMAppAttemptMetrics#getResourceUtilization sends a request to calculate 
running containers, and container X is almost finished. 
RMAppAttemptMetrics#getResourceUtilization adds the finished values to the 
running values and returns ResourceUtilization.
- Container X completes and its memorySeconds and vcoreSeconds are added to the 
finished values for appAttempt.
- RMAppAttemptMetrics#getResourceUtilization makes another request before the 3 
second interval, and the cached values are added to the finished values for 
appAttempt.
Since both the cached values and the finished values contain metrics for 
Container X, those are double counted until 3 seconds elapses and the next 
RMAppAttemptMetrics#getResourceUtilization request is made.

{quote}
And you can modify SchedulerApplicationAttempt#liveContainers to be a 
ConcurrentHashMap. With #6, get memory utilization to show metrics on web UI 
will not lock scheduling thread at all.
{quote}
I am a little reluctant to modify the type of 
SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems 
like something that could be done separately.


> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201

[jira] [Commented] (YARN-2338) service assemble so complex

2014-07-23 Thread tangjunjie (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072638#comment-14072638
 ] 

tangjunjie commented on YARN-2338:
--

Hello, Tsuyoshi OZAWA 
 I think service assembly should remove from resourcemanager because 
the main task for resourcemanager is alloct resource and so on.Consider use 
lightweight DI framwork like guice to refactor .Then, resourcemanager code will 
get rid of bad code smell. Use xml or annotation to display service assembley. 
For example,


 
..



I think test code will also benifit from this refactor. Because we can easily 
mock a service then inject for test.



> service assemble so complex
> ---
>
> Key: YARN-2338
> URL: https://issues.apache.org/jira/browse/YARN-2338
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: tangjunjie
>
>   See ResourceManager
> protected void serviceInit(Configuration configuration) throws Exception 
> So many service will assembe into resourcemanager.
> Use guice or other service assemble framework to refactor this complex code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072634#comment-14072634
 ] 

Hadoop QA commented on YARN-415:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657484/YARN-415.201407232237.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4408//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4408//console

This message is automatically generated.

> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2338) service assemble so complex

2014-07-23 Thread dingjiaqi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072661#comment-14072661
 ] 

dingjiaqi commented on YARN-2338:
-

Hi,Tsuyoshi OZAWA. I agree with tangjunjie.Do you need to refactor it?

> service assemble so complex
> ---
>
> Key: YARN-2338
> URL: https://issues.apache.org/jira/browse/YARN-2338
> Project: Hadoop YARN
>  Issue Type: Wish
>Reporter: tangjunjie
>
>   See ResourceManager
> protected void serviceInit(Configuration configuration) throws Exception 
> So many service will assembe into resourcemanager.
> Use guice or other service assemble framework to refactor this complex code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072696#comment-14072696
 ] 

Wangda Tan commented on YARN-415:
-

Hi Eric,
Thanks for updating your patch, I think now don't have major comments, 

*Following are some minor comments:*
1) RMAppAttemptImpl.java
1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you please 
revert them? Like
{code}
   RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition())
-  
+
{code}

1.2 getResourceUtilization:
{code}
+if (rmApps != null) {
+  RMApp app = rmApps.get(attemptId.getApplicationId());
+  if (app != null) {
{code}
I think the two cannot happen, we don't need check null to avoid potential bug 
here

{code}
+  ApplicationResourceUsageReport appResUsageRpt =
{code}
It's better to name it appResUsageReport since rpt is not a common abbr of 
report.

2) RMContainerImpl.java
2.1 updateAttemptMetrics:
{code}
  if (rmApps != null) {
RMApp rmApp = 
rmApps.get(container.getApplicationAttemptId().getApplicationId());
if (rmApp != null) {
{code}
Again, I think the two null check is unnecessary

3) SchedulerApplicationAttempt.java
3.1 Some rename suggestions: (Please let me know if you have better idea)
CACHE_MILLI -> MEMORY_UTILIZATION_CACHE_MILLISECONDS
lastTime -> lastMemoryUtilizationUpdateTime
cachedMemorySeconds -> lastMemorySeconds
same for cachedVCore ...

4) AppBlock.java
Should we rename "Resource Seconds:" to "Resource Utilization" or something?

5) Test
5.1 I'm wondering if we need add a end to end test, since we changed 
RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt.
It can consist submit an application, launch several containers, and finish 
application. And it's better to make the launched application contains several 
application attempt.
While the application running, there're muliple containers running, and 
multiple containers finished. We can check if total resource utilization are 
expected.

*To your comments:*
1) 
bq. One thing I did notice when these values are cached is that there is a race 
where containers can get counted twice:
I think this can not be avoid, it should be a transient state and Jian He and I 
discussed about this long time before.
But apparently, 3 sec cache make it not only a transient state. I suggest you 
can make "lastTime" in SchedulerApplicationAttempt protected. And in 
FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in 
completedContainer method). You can set lastTime to a negtive value like -1, 
and next time when trying to get accumulated resource utilization, it will 
recompute all container utilization.

2)
bq. I am a little reluctant to modify the type of 
SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems 
like something that could be done separately.
I think that will be fine :), because current getRunningResourceUtilization is 
called by getResourceUsageReport. And getResourceUsageReport is synchronized, 
no matter we changed liveContainers to concurrent map or not, we cannot solve 
the locking problem. 
I agree to enhance it in a separated JIRA in the future.

Thanks,
Wangda


> Capture memory utilization at the app-level for chargeback
> --
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn

[jira] [Commented] (YARN-2277) Add Cross-Origin support to the ATS REST API

2014-07-23 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072704#comment-14072704
 ] 

Jonathan Eagles commented on YARN-2277:
---

[~zjshen], [~vinodkv], do you have any comments or concerns with the approach 
above? Would like to get some feed back soon since TEZ-8 is basing work off of 
the CORS patch above.

> Add Cross-Origin support to the ATS REST API
> 
>
> Key: YARN-2277
> URL: https://issues.apache.org/jira/browse/YARN-2277
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: YARN-2277-CORS.patch, YARN-2277-JSONP.patch
>
>
> As the Application Timeline Server is not provided with built-in UI, it may 
> make sense to enable JSONP or CORS Rest API capabilities to allow for remote 
> UI to access the data directly via javascript without cross side server 
> browser blocks coming into play.
> Example client may be like
> http://api.jquery.com/jQuery.getJSON/ 
> This can alleviate the need to create a local proxy cache.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart

2014-07-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072707#comment-14072707
 ] 

Junping Du commented on YARN-1342:
--

Thanks for updating the patch, [~jlowe]! Patch looks good to me. 
Hey [~devaraj.k], if you don't have additional comments, I will commit it 
tomorrow.

> Recover container tokens upon nodemanager restart
> -
>
> Key: YARN-1342
> URL: https://issues.apache.org/jira/browse/YARN-1342
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1342.patch, YARN-1342v2.patch, 
> YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch, 
> YARN-1342v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common

2014-07-23 Thread Junping Du (JIRA)
Junping Du created YARN-2347:


 Summary: Consolidate RMStateVersion and NMDBSchemaVersion into 
StateVersion in yarn-server-common
 Key: YARN-2347
 URL: https://issues.apache.org/jira/browse/YARN-2347
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du


We have similar things for version state for RM, NM, TS (TimelineServer), etc. 
I think we should consolidate them into a common object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2300) Document better sample requests for RM web services for submitting apps

2014-07-23 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072722#comment-14072722
 ] 

Zhijie Shen commented on YARN-2300:
---

+1 LGTM, will commit the patch

> Document better sample requests for RM web services for submitting apps
> ---
>
> Key: YARN-2300
> URL: https://issues.apache.org/jira/browse/YARN-2300
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2300.0.patch
>
>
> The documentation for RM web services should provide better examples for app 
> submission.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common

2014-07-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2347:
-

Attachment: YARN-2347.patch

> Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in 
> yarn-server-common
> 
>
> Key: YARN-2347
> URL: https://issues.apache.org/jira/browse/YARN-2347
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-2347.patch
>
>
> We have similar things for version state for RM, NM, TS (TimelineServer), 
> etc. I think we should consolidate them into a common object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2300) Document better sample requests for RM web services for submitting apps

2014-07-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072745#comment-14072745
 ] 

Hudson commented on YARN-2300:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5957 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5957/])
YARN-2300. Improved the documentation of the sample requests for RM REST API - 
submitting an app. Contributed by Varun Vasudev. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612981)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm


> Document better sample requests for RM web services for submitting apps
> ---
>
> Key: YARN-2300
> URL: https://issues.apache.org/jira/browse/YARN-2300
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.5.0
>
> Attachments: apache-yarn-2300.0.patch
>
>
> The documentation for RM web services should provide better examples for app 
> submission.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-23 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072751#comment-14072751
 ] 

Arpit Agarwal commented on YARN-1994:
-

+1 for the v6 patch. I will hold off on committing until Vinod or another YARN 
committer can sanity check the changes. Thanks [~cwelch] and [~mipoto]!

I think there is a Windows line-endings issue with the patch hence Jenkins 
failed to pick it up. I was able to apply it with _git apply -p0 
--whitespace=fix_



> Expose YARN/MR endpoints on multiple interfaces
> ---
>
> Key: YARN-1994
> URL: https://issues.apache.org/jira/browse/YARN-1994
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager, webapp
>Affects Versions: 2.4.0
>Reporter: Arpit Agarwal
>Assignee: Craig Welch
> Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, 
> YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch
>
>
> YARN and MapReduce daemons currently do not support specifying a wildcard 
> address for the server endpoints. This prevents the endpoints from being 
> accessible from all interfaces on a multihomed machine.
> Note that if we do specify INADDR_ANY for any of the options, it will break 
> clients as they will attempt to connect to 0.0.0.0. We need a solution that 
> allows specifying a hostname or IP-address for clients while requesting 
> wildcard bind for the servers.
> (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time

2014-07-23 Thread Leitao Guo (JIRA)
Leitao Guo created YARN-2348:


 Summary: ResourceManager web UI should display locale time instead 
of UTC time
 Key: YARN-2348
 URL: https://issues.apache.org/jira/browse/YARN-2348
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
 Attachments: 1.before-change.jpg, 2.after-change.jpg

ResourceManager web UI, including application list and scheduler, displays UTC 
time in default,  this will confuse users who do not use UTC time. This web UI 
should display local time of users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time

2014-07-23 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2348:
-

Attachment: 2.after-change.jpg

> ResourceManager web UI should display locale time instead of UTC time
> -
>
> Key: YARN-2348
> URL: https://issues.apache.org/jira/browse/YARN-2348
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
> Attachments: 1.before-change.jpg, 2.after-change.jpg
>
>
> ResourceManager web UI, including application list and scheduler, displays 
> UTC time in default,  this will confuse users who do not use UTC time. This 
> web UI should display local time of users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time

2014-07-23 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2348:
-

Attachment: 1.before-change.jpg

> ResourceManager web UI should display locale time instead of UTC time
> -
>
> Key: YARN-2348
> URL: https://issues.apache.org/jira/browse/YARN-2348
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
> Attachments: 1.before-change.jpg, 2.after-change.jpg
>
>
> ResourceManager web UI, including application list and scheduler, displays 
> UTC time in default,  this will confuse users who do not use UTC time. This 
> web UI should display local time of users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time

2014-07-23 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2348:
-

Attachment: YARN-2348.patch

Please have a check of the patch.

> ResourceManager web UI should display locale time instead of UTC time
> -
>
> Key: YARN-2348
> URL: https://issues.apache.org/jira/browse/YARN-2348
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
> Attachments: 1.before-change.jpg, 2.after-change.jpg, YARN-2348.patch
>
>
> ResourceManager web UI, including application list and scheduler, displays 
> UTC time in default,  this will confuse users who do not use UTC time. This 
> web UI should display local time of users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common

2014-07-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072770#comment-14072770
 ] 

Hadoop QA commented on YARN-2347:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657521/YARN-2347.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4409//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4409//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4409//console

This message is automatically generated.

> Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in 
> yarn-server-common
> 
>
> Key: YARN-2347
> URL: https://issues.apache.org/jira/browse/YARN-2347
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-2347.patch
>
>
> We have similar things for version state for RM, NM, TS (TimelineServer), 
> etc. I think we should consolidate them into a common object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >