[jira] [Updated] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1680:
--
Attachment: YARN-1680-WIP.patch

Working in progress patch. I will add unit test.

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-10-01 Thread Abin Shahab (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abin Shahab updated YARN-1964:
--
Attachment: YARN-1964.patch

Harmonized changes between yarn-default.xml and YarnConfiguration. Updated docs.

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154522#comment-14154522
 ] 

Hadoop QA commented on YARN-1964:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672275/YARN-1964.patch
  against trunk revision 17d1202.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5194//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5194//console

This message is automatically generated.

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154687#comment-14154687
 ] 

Hudson commented on YARN-2387:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #697 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/697/])
YARN-2387. Resource Manager crashes with NPE due to lack of synchronization. 
Contributed by Mit Desai (jlowe: rev feaf139b4f327d33011e5a4424c06fb44c630955)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerStatusPBImpl.java


 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154684#comment-14154684
 ] 

Hudson commented on YARN-2610:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #697 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/697/])
YARN-2610. Hamlet should close table tags. (Ray Chiang via kasha) (kasha: rev 
f7743dd07dfbe0dde9be71acfaba16ded52adba7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/hamlet/Hamlet.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/view/TestInfoBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/hamlet/TestHamlet.java


 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Fix For: 2.6.0

 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154679#comment-14154679
 ] 

Hudson commented on YARN-1492:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #697 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/697/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/hadoop-yarn/bin/yarn


 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
Priority: Critical
 Attachments: YARN-1492-all-trunk-v1.patch, 
 YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
 YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
 shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
 shared_cache_design_v5.pdf, shared_cache_design_v6.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154690#comment-14154690
 ] 

Hudson commented on YARN-2594:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #697 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/697/])
YARN-2594. Potential deadlock in RM when querying 
ApplicationResourceUsageReport. (Wangda Tan via kasha) (kasha: rev 
14d60dadc25b044a2887bf912ba5872367f2dffb)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java


 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154686#comment-14154686
 ] 

Hudson commented on YARN-2602:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #697 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/697/])
YARN-2602. Fixed possible NPE in ApplicationHistoryManagerOnTimelineStore. 
Contributed by Zhijie Shen (jianhe: rev 
bbff96be48119774688981d04baf444639135977)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java


 Generic History Service of TimelineServer sometimes not able to handle NPE
 --

 Key: YARN-2602
 URL: https://issues.apache.org/jira/browse/YARN-2602
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
 Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 days, with many random example jobs running
Reporter: Karam Singh
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2602.1.patch


 ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 day, with many random example jobs running .
 When I ran WS API for AHS/GHS:
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/v1/applicationhistory/apps/application_1411579118376_0001'
 {code}
 It ran successfully.
 However
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/ws/v1/applicationhistory/apps'
 {exception:WebApplicationException,message:java.lang.NullPointerException,javaClassName:javax.ws.rs.WebApplicationException}
 {code}
 Failed with Internal server error 500.
 After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154691#comment-14154691
 ] 

Hudson commented on YARN-2627:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #697 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/697/])
YARN-2627. Added the info logs of attemptFailuresValidityInterval and number of 
previous failed attempts. Contributed by Xuan Gong. (zjshen: rev 
9582a50176800433ad3fa8829a50c28b859812a3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* hadoop-yarn-project/CHANGES.txt


 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154839#comment-14154839
 ] 

Hudson commented on YARN-2610:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1888 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1888/])
YARN-2610. Hamlet should close table tags. (Ray Chiang via kasha) (kasha: rev 
f7743dd07dfbe0dde9be71acfaba16ded52adba7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/hamlet/Hamlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/hamlet/TestHamlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/view/TestInfoBlock.java
* hadoop-yarn-project/CHANGES.txt


 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Fix For: 2.6.0

 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154833#comment-14154833
 ] 

Hudson commented on YARN-1492:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1888 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1888/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt


 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
Priority: Critical
 Attachments: YARN-1492-all-trunk-v1.patch, 
 YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
 YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
 shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
 shared_cache_design_v5.pdf, shared_cache_design_v6.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154834#comment-14154834
 ] 

Hudson commented on YARN-2179:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1888 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1888/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* hadoop-yarn-project/CHANGES.txt


 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Fix For: 2.7.0

 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154846#comment-14154846
 ] 

Hudson commented on YARN-2627:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1888 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1888/])
YARN-2627. Added the info logs of attemptFailuresValidityInterval and number of 
previous failed attempts. Contributed by Xuan Gong. (zjshen: rev 
9582a50176800433ad3fa8829a50c28b859812a3)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java


 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154842#comment-14154842
 ] 

Hudson commented on YARN-2387:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1888 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1888/])
YARN-2387. Resource Manager crashes with NPE due to lack of synchronization. 
Contributed by Mit Desai (jlowe: rev feaf139b4f327d33011e5a4424c06fb44c630955)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerStatusPBImpl.java


 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-10-01 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154851#comment-14154851
 ] 

Jason Lowe commented on YARN-2179:
--

The pom versions are incorrect in branch-2 from the cherry-pick.  The pom says 
3.0.0-SNAPSHOT, but it needs to be 2.6.0-SNAPSHOT in branch-2.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Fix For: 2.7.0

 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2633) TestContainerLauncherImpl sometimes fails

2014-10-01 Thread Mit Desai (JIRA)
Mit Desai created YARN-2633:
---

 Summary: TestContainerLauncherImpl sometimes fails
 Key: YARN-2633
 URL: https://issues.apache.org/jira/browse/YARN-2633
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Mit Desai
Assignee: Mit Desai


{noformat}
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NoSuchMethodException: 
org.apache.hadoop.yarn.api.ContainerManagementProtocol$$EnhancerByMockitoWithCGLIB$$25708415.close()
at java.lang.Class.getMethod(Class.java:1665)
at 
org.apache.hadoop.yarn.factories.impl.pb.RpcClientFactoryPBImpl.stopClient(RpcClientFactoryPBImpl.java:90)
at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.stopProxy(HadoopYarnProtoRPC.java:54)
at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.mayBeCloseProxy(ContainerManagementProtocolProxy.java:79)
at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:225)
at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.shutdownAllContainers(ContainerLauncherImpl.java:320)
at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.serviceStop(ContainerLauncherImpl.java:331)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.mapreduce.v2.app.launcher.TestContainerLauncherImpl.testMyShutdown(TestContainerLauncherImpl.java:315)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-10-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154872#comment-14154872
 ] 

Karthik Kambatla commented on YARN-2179:


Thanks for catching it, Jason. Just pushed another commit fixing the pom 
version in sharedcachemanager. 

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Fix For: 2.7.0

 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154914#comment-14154914
 ] 

Hudson commented on YARN-2627:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2627. Added the info logs of attemptFailuresValidityInterval and number of 
previous failed attempts. Contributed by Xuan Gong. (zjshen: rev 
9582a50176800433ad3fa8829a50c28b859812a3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* hadoop-yarn-project/CHANGES.txt


 Add logs when attemptFailuresValidityInterval is enabled
 

 Key: YARN-2627
 URL: https://issues.apache.org/jira/browse/YARN-2627
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2627.1.patch, YARN-2627.2.patch


 After YARN-611, users can specify attemptFailuresValidityInterval for their 
 applications. This is for testing/debug purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154903#comment-14154903
 ] 

Hudson commented on YARN-2179:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt


 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Fix For: 2.7.0

 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154910#comment-14154910
 ] 

Hudson commented on YARN-2387:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2387. Resource Manager crashes with NPE due to lack of synchronization. 
Contributed by Mit Desai (jlowe: rev feaf139b4f327d33011e5a4424c06fb44c630955)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerStatusPBImpl.java


 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154902#comment-14154902
 ] 

Hudson commented on YARN-1492:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris 
Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java
* hadoop-yarn-project/hadoop-yarn/bin/yarn
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
Priority: Critical
 Attachments: YARN-1492-all-trunk-v1.patch, 
 YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
 YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
 shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
 shared_cache_design_v5.pdf, shared_cache_design_v6.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154907#comment-14154907
 ] 

Hudson commented on YARN-2610:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2610. Hamlet should close table tags. (Ray Chiang via kasha) (kasha: rev 
f7743dd07dfbe0dde9be71acfaba16ded52adba7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/hamlet/TestHamlet.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/view/TestInfoBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/hamlet/Hamlet.java


 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Fix For: 2.6.0

 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154909#comment-14154909
 ] 

Hudson commented on YARN-2602:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2602. Fixed possible NPE in ApplicationHistoryManagerOnTimelineStore. 
Contributed by Zhijie Shen (jianhe: rev 
bbff96be48119774688981d04baf444639135977)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java


 Generic History Service of TimelineServer sometimes not able to handle NPE
 --

 Key: YARN-2602
 URL: https://issues.apache.org/jira/browse/YARN-2602
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
 Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 days, with many random example jobs running
Reporter: Karam Singh
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2602.1.patch


 ATS is running with AHS/GHS enabled to use TimelineStore.
 Running for 4-5 day, with many random example jobs running .
 When I ran WS API for AHS/GHS:
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/v1/applicationhistory/apps/application_1411579118376_0001'
 {code}
 It ran successfully.
 However
 {code}
 curl --negotiate -u : 
 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/ws/v1/applicationhistory/apps'
 {exception:WebApplicationException,message:java.lang.NullPointerException,javaClassName:javax.ws.rs.WebApplicationException}
 {code}
 Failed with Internal server error 500.
 After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154913#comment-14154913
 ] 

Hudson commented on YARN-2594:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1913 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1913/])
YARN-2594. Potential deadlock in RM when querying 
ApplicationResourceUsageReport. (Wangda Tan via kasha) (kasha: rev 
14d60dadc25b044a2887bf912ba5872367f2dffb)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* hadoop-yarn-project/CHANGES.txt


 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

2014-10-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154931#comment-14154931
 ] 

Junping Du commented on YARN-2613:
--

+1. Patch looks good to me. Will commit it shortly.

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch, YARN-2613.2.patch, YARN-2613.3.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2180) In-memory backing store for cache manager

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154934#comment-14154934
 ] 

Hadoop QA commented on YARN-2180:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12672206/YARN-2180-trunk-v6.patch
  against trunk revision 17d1202.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5195//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5195//console

This message is automatically generated.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, 
 YARN-2180-trunk-v6.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2634) Test failure for TestClientRMTokens

2014-10-01 Thread Junping Du (JIRA)
Junping Du created YARN-2634:


 Summary: Test failure for TestClientRMTokens
 Key: YARN-2634
 URL: https://issues.apache.org/jira/browse/YARN-2634
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Junping Du


The test get failed as below:
{noformat}
---
Test set: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
---
Tests run: 6, Failures: 3, Errors: 2, Skipped: 0, Time elapsed: 60.184 sec  
FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
testShortCircuitRenewCancelDifferentHostSamePort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
  Time elapsed: 22.693 sec   FAILURE!
java.lang.AssertionError: expected:getProxy but was:null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostSamePort(TestClientRMTokens.java:272)

testShortCircuitRenewCancelDifferentHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
  Time elapsed: 20.087 sec   FAILURE!
java.lang.AssertionError: expected:getProxy but was:null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostDifferentPort(TestClientRMTokens.java:283)

testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
  Time elapsed: 0.031 sec   ERROR!
java.lang.NullPointerException: null
at 
org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:148)
at 
org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:101)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:309)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:241)

testShortCircuitRenewCancelSameHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
  Time elapsed: 0.061 sec   FAILURE!
java.lang.AssertionError: expected:getProxy but was:null
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelSameHostDifferentPort(TestClientRMTokens.java:261)

testShortCircuitRenewCancelWildcardAddress(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
  Time elapsed: 0.07 sec   ERROR!
java.lang.NullPointerException: null
at org.apache.hadoop.net.NetUtils.isLocalAddress(NetUtils.java:684)
at 
org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:149)

   
1,1   Top

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2634) Test failure for TestClientRMTokens

2014-10-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2634:
-
Target Version/s: 2.6.0

 Test failure for TestClientRMTokens
 ---

 Key: YARN-2634
 URL: https://issues.apache.org/jira/browse/YARN-2634
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Junping Du

 The test get failed as below:
 {noformat}
 ---
 Test set: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
 ---
 Tests run: 6, Failures: 3, Errors: 2, Skipped: 0, Time elapsed: 60.184 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
 testShortCircuitRenewCancelDifferentHostSamePort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 22.693 sec   FAILURE!
 java.lang.AssertionError: expected:getProxy but was:null
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:743)
 at org.junit.Assert.assertEquals(Assert.java:118)
 at org.junit.Assert.assertEquals(Assert.java:144)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostSamePort(TestClientRMTokens.java:272)
 testShortCircuitRenewCancelDifferentHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 20.087 sec   FAILURE!
 java.lang.AssertionError: expected:getProxy but was:null
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:743)
 at org.junit.Assert.assertEquals(Assert.java:118)
 at org.junit.Assert.assertEquals(Assert.java:144)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostDifferentPort(TestClientRMTokens.java:283)
 testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 0.031 sec   ERROR!
 java.lang.NullPointerException: null
 at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:148)
 at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:101)
 at org.apache.hadoop.security.token.Token.renew(Token.java:377)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:309)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:241)
 testShortCircuitRenewCancelSameHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 0.061 sec   FAILURE!
 java.lang.AssertionError: expected:getProxy but was:null
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:743)
 at org.junit.Assert.assertEquals(Assert.java:118)
 at org.junit.Assert.assertEquals(Assert.java:144)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelSameHostDifferentPort(TestClientRMTokens.java:261)
 testShortCircuitRenewCancelWildcardAddress(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.NullPointerException: null
 at org.apache.hadoop.net.NetUtils.isLocalAddress(NetUtils.java:684)
 at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:149)
   
   
1,1   Top
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2545) RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED

2014-10-01 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154946#comment-14154946
 ] 

Hong Zhiguo commented on YARN-2545:
---

How about the state of appAttempt? should it finally be FAILED instead of 
FINISHED?

 RMApp should transit to FAILED when AM calls finishApplicationMaster with 
 FAILED
 

 Key: YARN-2545
 URL: https://issues.apache.org/jira/browse/YARN-2545
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor

 If AM calls finishApplicationMaster with getFinalApplicationStatus()==FAILED, 
 and then exits, the corresponding RMApp and RMAppAttempt transit to state 
 FINISHED.
 I think this is wrong and confusing. On RM WebUI, this application is 
 displayed as State=FINISHED, FinalStatus=FAILED, and is counted as Apps 
 Completed, not as Apps Failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields

2014-10-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2615:
-
Attachment: YARN-2615.patch

Upload the first patch, include the changes on ClientToAMTokenIdentifier (and 
test), RMDelegationTokenIdentifier and TimelineDelegationTokenIdentifier. The 
compatibility tests for RMDelegationTokenIdentifier haven't been completed due 
to test failures on TestClientRMTokens failed on trunk (without code here), 
filed YARN-2634 to fix it before get test in.

 ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended 
 fields
 

 Key: YARN-2615
 URL: https://issues.apache.org/jira/browse/YARN-2615
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2615.patch


 As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier 
 and DelegationTokenIdentifier should also be updated in the same way to allow 
 fields get extended in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2634) Test failure for TestClientRMTokens

2014-10-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2634:
-
Priority: Blocker  (was: Major)

 Test failure for TestClientRMTokens
 ---

 Key: YARN-2634
 URL: https://issues.apache.org/jira/browse/YARN-2634
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Junping Du
Priority: Blocker

 The test get failed as below:
 {noformat}
 ---
 Test set: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
 ---
 Tests run: 6, Failures: 3, Errors: 2, Skipped: 0, Time elapsed: 60.184 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
 testShortCircuitRenewCancelDifferentHostSamePort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 22.693 sec   FAILURE!
 java.lang.AssertionError: expected:getProxy but was:null
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:743)
 at org.junit.Assert.assertEquals(Assert.java:118)
 at org.junit.Assert.assertEquals(Assert.java:144)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostSamePort(TestClientRMTokens.java:272)
 testShortCircuitRenewCancelDifferentHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 20.087 sec   FAILURE!
 java.lang.AssertionError: expected:getProxy but was:null
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:743)
 at org.junit.Assert.assertEquals(Assert.java:118)
 at org.junit.Assert.assertEquals(Assert.java:144)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostDifferentPort(TestClientRMTokens.java:283)
 testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 0.031 sec   ERROR!
 java.lang.NullPointerException: null
 at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:148)
 at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:101)
 at org.apache.hadoop.security.token.Token.renew(Token.java:377)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:309)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:241)
 testShortCircuitRenewCancelSameHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 0.061 sec   FAILURE!
 java.lang.AssertionError: expected:getProxy but was:null
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:743)
 at org.junit.Assert.assertEquals(Assert.java:118)
 at org.junit.Assert.assertEquals(Assert.java:144)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelSameHostDifferentPort(TestClientRMTokens.java:261)
 testShortCircuitRenewCancelWildcardAddress(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.NullPointerException: null
 at org.apache.hadoop.net.NetUtils.isLocalAddress(NetUtils.java:684)
 at 
 org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:149)
   
   
1,1   Top
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-10-01 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-015.patch

parch -15; this is patch -14 rebased against trunk with a conflict fixed

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, yarnregistry.pdf, 
 yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-10-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155093#comment-14155093
 ] 

Vinod Kumar Vavilapalli commented on YARN-1063:
---

Tx for the updates [~rusanu]! I am committing this now to unblock the follow up 
patches, trusting [~ivanmi]'s reviews on the Windows side of things.

 Winutils needs ability to create task as domain user
 

 Key: YARN-1063
 URL: https://issues.apache.org/jira/browse/YARN-1063
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
 Environment: Windows
Reporter: Kyle Leckie
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
 YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch


 h1. Summary:
 Securing a Hadoop cluster requires constructing some form of security 
 boundary around the processes executed in YARN containers. Isolation based on 
 Windows user isolation seems most feasible. This approach is similar to the 
 approach taken by the existing LinuxContainerExecutor. The current patch to 
 winutils.exe adds the ability to create a process as a domain user. 
 h1. Alternative Methods considered:
 h2. Process rights limited by security token restriction:
 On Windows access decisions are made by examining the security token of a 
 process. It is possible to spawn a process with a restricted security token. 
 Any of the rights granted by SIDs of the default token may be restricted. It 
 is possible to see this in action by examining the security tone of a 
 sandboxed process launch be a web browser. Typically the launched process 
 will have a fully restricted token and need to access machine resources 
 through a dedicated broker process that enforces a custom security policy. 
 This broker process mechanism would break compatibility with the typical 
 Hadoop container process. The Container process must be able to utilize 
 standard function calls for disk and network IO. I performed some work 
 looking at ways to ACL the local files to the specific launched without 
 granting rights to other processes launched on the same machine but found 
 this to be an overly complex solution. 
 h2. Relying on APP containers:
 Recent versions of windows have the ability to launch processes within an 
 isolated container. Application containers are supported for execution of 
 WinRT based executables. This method was ruled out due to the lack of 
 official support for standard windows APIs. At some point in the future 
 windows may support functionality similar to BSD jails or Linux containers, 
 at that point support for containers should be added.
 h1. Create As User Feature Description:
 h2. Usage:
 A new sub command was added to the set of task commands. Here is the syntax:
 winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
 Some notes:
 * The username specified is in the format of user@domain
 * The machine executing this command must be joined to the domain of the user 
 specified
 * The domain controller must allow the account executing the command access 
 to the user information. For this join the account to the predefined group 
 labeled Pre-Windows 2000 Compatible Access
 * The account running the command must have several rights on the local 
 machine. These can be managed manually using secpol.msc: 
 ** Act as part of the operating system - SE_TCB_NAME
 ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME
 ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME
 * The launched process will not have rights to the desktop so will not be 
 able to display any information or create UI.
 * The launched process will have no network credentials. Any access of 
 network resources that requires domain authentication will fail.
 h2. Implementation:
 Winutils performs the following steps:
 # Enable the required privileges for the current process.
 # Register as a trusted process with the Local Security Authority (LSA).
 # Create a new logon for the user passed on the command line.
 # Load/Create a profile on the local machine for the new logon.
 # Create a new environment for the new logon.
 # Launch the new process in a job with the task name specified and using the 
 created logon.
 # Wait for the JOB to exit.
 h2. Future work:
 The following work was scoped out of this check in:
 * Support for non-domain users or machine that are not domain joined.
 * Support for privilege isolation by running the task launcher in a high 
 privilege service with access over an ACLed named pipe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2616) Add CLI client to the registry to list/view entries

2014-10-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155098#comment-14155098
 ] 

Steve Loughran commented on YARN-2616:
--

thanks.

I'm going to pull this down into the main YARN-913 patch  sync up with 
changes, but will then post the patch here for it to be reviewed/completed in 
isolation.

# I'll set things up for tests to go in, though I won't do the tests...I'll 
leave that as half the challenge.
# Here's my evolving  [[Updated Hadoop style 
guide|https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md]]

 Add CLI client to the registry to list/view entries
 ---

 Key: YARN-2616
 URL: https://issues.apache.org/jira/browse/YARN-2616
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Steve Loughran
Assignee: Akshay Radia
 Attachments: yarn-2616-v1.patch, yarn-2616-v2.patch


 registry needs a CLI interface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories

2014-10-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2624:
---
 Priority: Blocker  (was: Major)
 Target Version/s: 2.6.0
Affects Version/s: 2.5.1

 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-2624
 URL: https://issues.apache.org/jira/browse/YARN-2624
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Blocker

 We have found resource localization fails on a cluster with following error 
 in certain cases.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to download rsrc { { 
 hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
  1412027745352, FILE, null 
 },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /data/yarn/nm/filecache/27
   at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
   at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
   at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2632) Document NM Restart feature

2014-10-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2632:
---
Priority: Blocker  (was: Major)

Marking this a blocker to ensure we don't miss it in 2.6. 

 Document NM Restart feature
 ---

 Key: YARN-2632
 URL: https://issues.apache.org/jira/browse/YARN-2632
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Junping Du
Priority: Blocker

 As a new feature to YARN, we should document this feature's behavior, 
 configuration, and things to pay attention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1972) Implement secure Windows Container Executor

2014-10-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1972:
--
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-732

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
 YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, 
 YARN-1972.trunk.5.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-732) YARN support for container isolation on Windows

2014-10-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-732:
-
Fix Version/s: (was: trunk-win)

 YARN support for container isolation on Windows
 ---

 Key: YARN-732
 URL: https://issues.apache.org/jira/browse/YARN-732
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: trunk-win
Reporter: Kyle Leckie
  Labels: security
 Attachments: winutils.diff


 There is no ContainerExecutor on windows that can launch containers in a 
 manner that creates:
 1) container isolation
 2) container execution with reduced rights
 I am working on patches that will add the ability to launch containers in a 
 process with a reduced access token. 
 Update: After examining several approaches I have settled on launching the 
 task as a domain user. I have attached the current winutils diff which is a 
 work in progress. 
 Work remaining:
 - Create isolated desktop for task processes.
 - Set integrity of spawned processed to low.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2129) Add scheduling priority to the WindowsSecureContainerExecutor

2014-10-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2129:
--
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-732

 Add scheduling priority to the WindowsSecureContainerExecutor
 -

 Key: YARN-2129
 URL: https://issues.apache.org/jira/browse/YARN-2129
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-2129.1.patch, YARN-2129.2.patch


 The WCE (YARN-1972) could and should honor 
 NM_CONTAINER_EXECUTOR_SCHED_PRIORITY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155116#comment-14155116
 ] 

Hudson commented on YARN-1063:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6164 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6164/])
YARN-1063. Augmented Hadoop common winutils to have the ability to create 
containers as domain users. Contributed by Remus Rusanu. (vinodkv: rev 
5ca97f1e60b8a7848f6eadd15f6c08ed390a8cda)
* hadoop-yarn-project/CHANGES.txt
* hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c
* hadoop-common-project/hadoop-common/src/main/winutils/symlink.c
* hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h
* hadoop-common-project/hadoop-common/src/main/winutils/task.c
* 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestWinUtils.java
* hadoop-common-project/hadoop-common/src/main/winutils/chown.c


 Winutils needs ability to create task as domain user
 

 Key: YARN-1063
 URL: https://issues.apache.org/jira/browse/YARN-1063
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
 Environment: Windows
Reporter: Kyle Leckie
Assignee: Remus Rusanu
  Labels: security, windows
 Fix For: 2.6.0

 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
 YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch


 h1. Summary:
 Securing a Hadoop cluster requires constructing some form of security 
 boundary around the processes executed in YARN containers. Isolation based on 
 Windows user isolation seems most feasible. This approach is similar to the 
 approach taken by the existing LinuxContainerExecutor. The current patch to 
 winutils.exe adds the ability to create a process as a domain user. 
 h1. Alternative Methods considered:
 h2. Process rights limited by security token restriction:
 On Windows access decisions are made by examining the security token of a 
 process. It is possible to spawn a process with a restricted security token. 
 Any of the rights granted by SIDs of the default token may be restricted. It 
 is possible to see this in action by examining the security tone of a 
 sandboxed process launch be a web browser. Typically the launched process 
 will have a fully restricted token and need to access machine resources 
 through a dedicated broker process that enforces a custom security policy. 
 This broker process mechanism would break compatibility with the typical 
 Hadoop container process. The Container process must be able to utilize 
 standard function calls for disk and network IO. I performed some work 
 looking at ways to ACL the local files to the specific launched without 
 granting rights to other processes launched on the same machine but found 
 this to be an overly complex solution. 
 h2. Relying on APP containers:
 Recent versions of windows have the ability to launch processes within an 
 isolated container. Application containers are supported for execution of 
 WinRT based executables. This method was ruled out due to the lack of 
 official support for standard windows APIs. At some point in the future 
 windows may support functionality similar to BSD jails or Linux containers, 
 at that point support for containers should be added.
 h1. Create As User Feature Description:
 h2. Usage:
 A new sub command was added to the set of task commands. Here is the syntax:
 winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
 Some notes:
 * The username specified is in the format of user@domain
 * The machine executing this command must be joined to the domain of the user 
 specified
 * The domain controller must allow the account executing the command access 
 to the user information. For this join the account to the predefined group 
 labeled Pre-Windows 2000 Compatible Access
 * The account running the command must have several rights on the local 
 machine. These can be managed manually using secpol.msc: 
 ** Act as part of the operating system - SE_TCB_NAME
 ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME
 ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME
 * The launched process will not have rights to the desktop so will not be 
 able to display any information or create UI.
 * The launched process will have no network credentials. Any access of 
 network resources that requires domain authentication will fail.
 h2. Implementation:
 Winutils performs the following steps:
 # Enable the required privileges for the current process.
 # Register as a trusted process with the Local Security Authority (LSA).
 # Create a new logon for the user passed on the command line.
 # Load/Create a profile on the local machine for the new logon.
 # Create a new 

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-10-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155127#comment-14155127
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

bq. Remus Rusanu Vinod Kumar Vavilapalli, as on YARN-1063, we can go ahead and 
address these comments as part of the YARN-2198 effort, it's not necessary to 
resolve these before these patches are committed.
+1 for tracking the remaining issues at YARN-1063.

This looks good, checking this in.

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
 YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, 
 YARN-1972.trunk.5.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155136#comment-14155136
 ] 

Zhijie Shen commented on YARN-2630:
---

Make sense. +1

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155139#comment-14155139
 ] 

Hadoop QA commented on YARN-2615:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672344/YARN-2615.patch
  against trunk revision 3f25d91.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.crypto.random.TestOsSecureRandom
  org.apache.hadoop.ha.TestZKFailoverControllerStress
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5196//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5196//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5196//console

This message is automatically generated.

 ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended 
 fields
 

 Key: YARN-2615
 URL: https://issues.apache.org/jira/browse/YARN-2615
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2615.patch


 As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier 
 and DelegationTokenIdentifier should also be updated in the same way to allow 
 fields get extended in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2635) TestRMRestart fails with FairScheduler

2014-10-01 Thread Wei Yan (JIRA)
Wei Yan created YARN-2635:
-

 Summary: TestRMRestart fails with FairScheduler
 Key: YARN-2635
 URL: https://issues.apache.org/jira/browse/YARN-2635
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wei Yan
Assignee: Wei Yan


If we change the scheduler from Capacity Scheduler to Fair Scheduler, the 
TestRMRestart would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155151#comment-14155151
 ] 

Hudson commented on YARN-1972:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6165 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6165/])
YARN-1972. Added a secure container-executor for Windows. Contributed by Remus 
Rusanu. (vinodkv: rev ba7f31c2ee8d23ecb183f88920ef06053c0b9769)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/index.apt.vm
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java


 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
 YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, 
 YARN-1972.trunk.5.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the 

[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155169#comment-14155169
 ] 

Hadoop QA commented on YARN-1063:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657587/YARN-1063.6.patch
  against trunk revision 04b0843.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5197//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5197//console

This message is automatically generated.

 Winutils needs ability to create task as domain user
 

 Key: YARN-1063
 URL: https://issues.apache.org/jira/browse/YARN-1063
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
 Environment: Windows
Reporter: Kyle Leckie
Assignee: Remus Rusanu
  Labels: security, windows
 Fix For: 2.6.0

 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
 YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch


 h1. Summary:
 Securing a Hadoop cluster requires constructing some form of security 
 boundary around the processes executed in YARN containers. Isolation based on 
 Windows user isolation seems most feasible. This approach is similar to the 
 approach taken by the existing LinuxContainerExecutor. The current patch to 
 winutils.exe adds the ability to create a process as a domain user. 
 h1. Alternative Methods considered:
 h2. Process rights limited by security token restriction:
 On Windows access decisions are made by examining the security token of a 
 process. It is possible to spawn a process with a restricted security token. 
 Any of the rights granted by SIDs of the default token may be restricted. It 
 is possible to see this in action by examining the security tone of a 
 sandboxed process launch be a web browser. Typically the launched process 
 will have a fully restricted token and need to access machine resources 
 through a dedicated broker process that enforces a custom security policy. 
 This broker process mechanism would break compatibility with the typical 
 Hadoop container process. The Container process must be able to utilize 
 standard function calls for disk and network IO. I performed some work 
 looking at ways to ACL the local files to the specific launched without 
 granting rights to other processes launched on the same machine but found 
 this to be an overly complex solution. 
 h2. Relying on APP containers:
 Recent versions of windows have the ability to launch processes within an 
 isolated container. Application containers are supported for execution of 
 WinRT based executables. This method was ruled out due to the lack of 
 official support for standard windows APIs. At some point in the future 
 windows may support functionality similar to BSD jails or Linux containers, 
 at that point support for containers should be added.
 h1. Create As User Feature Description:
 h2. Usage:
 A new sub command was added to the set of task commands. Here is the syntax:
 winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
 Some notes:
 * The username specified is in the format of user@domain
 * The machine executing this command must be joined to the domain of the user 
 specified
 * The domain controller must allow the account executing the command access 
 to the user information. For this join the account to the predefined group 
 labeled Pre-Windows 2000 Compatible Access
 * The account running the command must have several rights on the local 
 machine. These can be managed manually using secpol.msc: 
 ** Act as part of the operating system - SE_TCB_NAME
 ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME
 ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME
 * The launched process will not have rights to the desktop so 

[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1879:

Attachment: YARN-1879.16.patch

[~ozawa] I have updated your patch to compile with latest trunk. [~jianhe] can 
you please take a look

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, 
 YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, 
 YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2630:
--
Attachment: YARN-2630.3.patch

Uploaded a patch which renames 
NodeHeartbeatResponse#getFinishedContainersPulledByAM to 
getContainersToBeRemovedFromNM, as I think if in the future we add one more 
channel (not just pulled by AM) to remove containers from NM, the latter is 
more semantically correct.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-10-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155220#comment-14155220
 ] 

Jian He commented on YARN-2617:
---

looks good, +1

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155227#comment-14155227
 ] 

Zhijie Shen commented on YARN-2630:
---

Would you please check finishedContainersPulledByAM is completely replaced in 
the code base?
{code}
-if (this.finishedContainersPulledByAM != null) {
+if (this.containersToBeRemovedFromNM != null) {
   addFinishedContainersPulledByAMToProto();
 }
{code}
{code}
-  public void addFinishedContainersPulledByAM(
+  public void addContainersToBeRemovedFromNM(
   final ListContainerId finishedContainersPulledByAM) {
 if (finishedContainersPulledByAM == null)
   return;
 initFinishedContainersPulledByAM();
-this.finishedContainersPulledByAM.addAll(finishedContainersPulledByAM);
+this.containersToBeRemovedFromNM.addAll(finishedContainersPulledByAM);
{code}
{code}
-  nhResponse.addFinishedContainersPulledByAM(finishedContainersPulledByAM);
+  nhResponse.addContainersToBeRemovedFromNM(finishedContainersPulledByAM);
{code}
{code}
-  response.addFinishedContainersPulledByAM(
+  response.addContainersToBeRemovedFromNM(
   new ArrayListContainerId(this.finishedContainersPulledByAM));
{code}

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155235#comment-14155235
 ] 

Jian Fang commented on YARN-1680:
-

I may be wrong because I don't understand the logic fully. Seems your patch 
calculates the blacklisted resource for each application. Please clarify for me 
whether the blacklisted node is a cluster level concept or an application level 
one. What if multiple applications have different sets of blacklisted nodes? If 
the blacklisted node is at the cluster level, the blacklisted resource seems 
should be calculated at the cluster level, that is to say, you need to get the 
blacklisted nodes from other applications as well. If it is only at the 
application level, I wonder how the blacklist-task-tracker command works in 
hadoop one. 

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1972) Implement secure Windows Container Executor

2014-10-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1972:
--
Attachment: YARN-1972.delta.5-branch-2.patch

The patch doesn't apply on branch-2. Generated it myself, attaching now.

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
 YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, 
 YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155259#comment-14155259
 ] 

Hadoop QA commented on YARN-1972:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12672375/YARN-1972.delta.5-branch-2.patch
  against trunk revision 1f5b42a.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5200//console

This message is automatically generated.

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
 YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, 
 YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch


 h1. Windows Secure Container Executor (WCE)
 YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
 user as a solution for the problem of having a security boundary between 
 processes executed in YARN containers and the Hadoop services. The WCE is a 
 container executor that leverages the winutils capabilities introduced in 
 YARN-1063 and launches containers as an OS process running as the job 
 submitter user. A description of the S4U infrastructure used by YARN-1063 
 alternatives considered can be read on that JIRA.
 The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
 drive the flow of execution, but it overwrrides some emthods to the effect of:
 * change the DCE created user cache directories to be owned by the job user 
 and by the nodemanager group.
 * changes the actual container run command to use the 'createAsUser' command 
 of winutils task instead of 'create'
 * runs the localization as standalone process instead of an in-process Java 
 method call. This in turn relies on the winutil createAsUser feature to run 
 the localization as the job user.
  
 When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
 differences:
 * it does no delegate the creation of the user cache directories to the 
 native implementation.
 * it does no require special handling to be able to delete user files
 The approach on the WCE came from a practical trial-and-error approach. I had 
 to iron out some issues around the Windows script shell limitations (command 
 line length) to get it to work, the biggest issue being the huge CLASSPATH 
 that is commonplace in Hadoop environment container executions. The job 
 container itself is already dealing with this via a so called 'classpath 
 jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
 as a separate container the same issue had to be resolved and I used the same 
 'classpath jar' approach.
 h2. Deployment Requirements
 To use the WCE one needs to set the 
 `yarn.nodemanager.container-executor.class` to 
 `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
 and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
 Windows security group name that is the nodemanager service principal is a 
 member of (equivalent of LCE 
 `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
 does not require any configuration outside of the Hadoop own's yar-site.xml.
 For WCE to work the nodemanager must run as a service principal that is 
 member of the local Administrators group or LocalSystem. this is derived from 
 the need to invoke LoadUserProfile API which mention these requirements in 
 the specifications. This is in addition to the SE_TCB privilege mentioned in 
 YARN-1063, but this requirement will automatically imply that the SE_TCB 
 privilege is held by the nodemanager. For the Linux speakers in the audience, 
 the requirement is basically to run NM as root.
 h2. Dedicated high privilege Service
 Due to the high privilege required by the WCE we had discussed the need to 
 isolate the high privilege operations into a separate process, an 'executor' 
 service that is solely responsible to start the containers (incloding the 
 localizer). The NM would have to authenticate, authorize and communicate with 
 this service via an IPC mechanism and use this service to launch the 
 containers. I still believe we'll end up deploying such a service, but the 
 effort to onboard such a new platfrom specific new service on the project are 
 not trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155282#comment-14155282
 ] 

Jian Fang commented on YARN-1680:
-

Also, seems the variable blackListedResources in SchedulerApplicationAttempt is 
not initialized in YARN-1680-WIP.patch and it causes NPE. 

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free

2014-10-01 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2628:

Attachment: apache-yarn-2628.0.patch

Uploaded a patch with fix and test case.

 Capacity scheduler with DominantResourceCalculator carries out reservation 
 even though slots are free
 -

 Key: YARN-2628
 URL: https://issues.apache.org/jira/browse/YARN-2628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2628.0.patch


 We've noticed that if you run the CapacityScheduler with the 
 DominantResourceCalculator, sometimes apps will end up with containers in a 
 reserved state even though free slots are available.
 The root cause seems to be this piece of code from CapacityScheduler.java -
 {noformat}
 // Try to schedule more if there are no reservations to fulfill
 if (node.getReservedContainer() == null) {
   if (Resources.greaterThanOrEqual(calculator, getClusterResource(),
   node.getAvailableResource(), minimumAllocation)) {
 if (LOG.isDebugEnabled()) {
   LOG.debug(Trying to schedule on node:  + node.getNodeName() +
   , available:  + node.getAvailableResource());
 }
 root.assignContainers(clusterResource, node, false);
   }
 } else {
   LOG.info(Skipping scheduling since node  + node.getNodeID() + 
is reserved by application  + 
   
 node.getReservedContainer().getContainerId().getApplicationAttemptId()
   );
 }
 {noformat}
 The code is meant to check if a node has any slots available for containers . 
 Since it uses the greaterThanOrEqual function, we end up in situation where 
 greaterThanOrEqual returns true, even though we may not have enough CPU or 
 memory to actually run the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155295#comment-14155295
 ] 

Craig Welch commented on YARN-1680:
---

As I recall, blacklisted nodes are application level

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155302#comment-14155302
 ] 

Hadoop QA commented on YARN-1879:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672365/YARN-1879.16.patch
  against trunk revision 737f280.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5198//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5198//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5198//console

This message is automatically generated.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, 
 YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, 
 YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155318#comment-14155318
 ] 

Hadoop QA commented on YARN-2630:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672368/YARN-2630.3.patch
  against trunk revision 1f5b42a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5199//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5199//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5199//console

This message is automatically generated.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, 
 YARN-2630.4.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155344#comment-14155344
 ] 

Jian Fang commented on YARN-1680:
-

Is there any behavior change from hadoop one to hadoop two for the blacklist 
node? Seems HADOOP-5643 discussed the ability to blacklist tasktracker. 

We have a use case to blacklist a node at the cluster level before decommission 
the node so as to gracefully remove this node. If the blacklist is only 
application level, then we have to figure out something else.

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155352#comment-14155352
 ] 

Hadoop QA commented on YARN-2630:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672374/YARN-2630.4.patch
  against trunk revision 1f5b42a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5201//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5201//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5201//console

This message is automatically generated.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, 
 YARN-2630.4.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2636) Windows Secure Container Executor: add unit tests for WSCE

2014-10-01 Thread Remus Rusanu (JIRA)
Remus Rusanu created YARN-2636:
--

 Summary: Windows Secure Container Executor: add unit tests for WSCE
 Key: YARN-2636
 URL: https://issues.apache.org/jira/browse/YARN-2636
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical


As title says.

The WSCE has no check-in unit tests. Much of the functionality depends on 
elevated hadoopwinutilsvc service and cannot be tested, but lets test what is 
possible to be mocked in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155360#comment-14155360
 ] 

Craig Welch commented on YARN-1680:
---

There are different kinds of blacklisting, the one at issue in this jira is the 
application level one.  The cluster level one ends up with the node's resource 
value being removed from the cluster resource and it doesn't need to be 
addressed here (because removing it from the cluster resource removes it's 
resource amount from any headroom calculation already), this is to address the 
application level blacklist, which needs to be handled at this level.

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2408) Resource Request REST API for YARN

2014-10-01 Thread Renan DelValle (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renan DelValle updated YARN-2408:
-
Attachment: YARN-2408.4.patch

Clustered resource requests that have the same priority, same number of 
containers, same relax locality, and same number of cores.

 Resource Request REST API for YARN
 --

 Key: YARN-2408
 URL: https://issues.apache.org/jira/browse/YARN-2408
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp
Reporter: Renan DelValle
  Labels: features
 Attachments: YARN-2408.4.patch


 I’m proposing a new REST API for YARN which exposes a snapshot of the 
 Resource Requests that exist inside of the Scheduler. My motivation behind 
 this new feature is to allow external software to monitor the amount of 
 resources being requested to gain more insightful information into cluster 
 usage than is already provided. The API can also be used by external software 
 to detect a starved application and alert the appropriate users and/or sys 
 admin so that the problem may be remedied.
 Here is the proposed API (a JSON counterpart is also available):
 {code:xml}
 resourceRequests
   MB7680/MB
   VCores7/VCores
   appMaster
 applicationIdapplication_1412191664217_0001/applicationId
 
 applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId
 queueNamedefault/queueName
 totalMB6144/totalMB
 totalVCores6/totalVCores
 numResourceRequests3/numResourceRequests
 requests
   request
 MB1024/MB
 VCores1/VCores
 numContainers6/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
 resourceNames
   resourceNamelocalMachine/resourceName
   resourceName/default-rack/resourceName
   resourceName*/resourceName
 /resourceNames
   /request
 /requests
   /appMaster
   appMaster
   ...
   /appMaster
 /resourceRequests
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2408) Resource Request REST API for YARN

2014-10-01 Thread Renan DelValle (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renan DelValle updated YARN-2408:
-
Attachment: (was: YARN-2408-3.patch)

 Resource Request REST API for YARN
 --

 Key: YARN-2408
 URL: https://issues.apache.org/jira/browse/YARN-2408
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp
Reporter: Renan DelValle
  Labels: features
 Attachments: YARN-2408.4.patch


 I’m proposing a new REST API for YARN which exposes a snapshot of the 
 Resource Requests that exist inside of the Scheduler. My motivation behind 
 this new feature is to allow external software to monitor the amount of 
 resources being requested to gain more insightful information into cluster 
 usage than is already provided. The API can also be used by external software 
 to detect a starved application and alert the appropriate users and/or sys 
 admin so that the problem may be remedied.
 Here is the proposed API (a JSON counterpart is also available):
 {code:xml}
 resourceRequests
   MB7680/MB
   VCores7/VCores
   appMaster
 applicationIdapplication_1412191664217_0001/applicationId
 
 applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId
 queueNamedefault/queueName
 totalMB6144/totalMB
 totalVCores6/totalVCores
 numResourceRequests3/numResourceRequests
 requests
   request
 MB1024/MB
 VCores1/VCores
 numContainers6/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
 resourceNames
   resourceNamelocalMachine/resourceName
   resourceName/default-rack/resourceName
   resourceName*/resourceName
 /resourceNames
   /request
 /requests
   /appMaster
   appMaster
   ...
   /appMaster
 /resourceRequests
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2408) Resource Request REST API for YARN

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155372#comment-14155372
 ] 

Hadoop QA commented on YARN-2408:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672388/YARN-2408.4.patch
  against trunk revision 1f5b42a.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5203//console

This message is automatically generated.

 Resource Request REST API for YARN
 --

 Key: YARN-2408
 URL: https://issues.apache.org/jira/browse/YARN-2408
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp
Reporter: Renan DelValle
  Labels: features
 Attachments: YARN-2408.4.patch


 I’m proposing a new REST API for YARN which exposes a snapshot of the 
 Resource Requests that exist inside of the Scheduler. My motivation behind 
 this new feature is to allow external software to monitor the amount of 
 resources being requested to gain more insightful information into cluster 
 usage than is already provided. The API can also be used by external software 
 to detect a starved application and alert the appropriate users and/or sys 
 admin so that the problem may be remedied.
 Here is the proposed API (a JSON counterpart is also available):
 {code:xml}
 resourceRequests
   MB7680/MB
   VCores7/VCores
   appMaster
 applicationIdapplication_1412191664217_0001/applicationId
 
 applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId
 queueNamedefault/queueName
 totalMB6144/totalMB
 totalVCores6/totalVCores
 numResourceRequests3/numResourceRequests
 requests
   request
 MB1024/MB
 VCores1/VCores
 numContainers6/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
 resourceNames
   resourceNamelocalMachine/resourceName
   resourceName/default-rack/resourceName
   resourceName*/resourceName
 /resourceNames
   /request
 /requests
   /appMaster
   appMaster
   ...
   /appMaster
 /resourceRequests
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-01 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155381#comment-14155381
 ] 

Jian Fang commented on YARN-1680:
-

Thanks Craig for your clarification. Is the cluster level blacklisted node 
called an unhealthy node? I checked Hadoop two code, but only found the cluster 
level blacklist related to the parameters such as 
yarn.nodemanager.health-checker.script.path. Are there any other code paths for 
the cluster level blacklist in hadoop two?

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155393#comment-14155393
 ] 

Hadoop QA commented on YARN-2628:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12672381/apache-yarn-2628.0.patch
  against trunk revision 1f5b42a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5202//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5202//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5202//console

This message is automatically generated.

 Capacity scheduler with DominantResourceCalculator carries out reservation 
 even though slots are free
 -

 Key: YARN-2628
 URL: https://issues.apache.org/jira/browse/YARN-2628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2628.0.patch


 We've noticed that if you run the CapacityScheduler with the 
 DominantResourceCalculator, sometimes apps will end up with containers in a 
 reserved state even though free slots are available.
 The root cause seems to be this piece of code from CapacityScheduler.java -
 {noformat}
 // Try to schedule more if there are no reservations to fulfill
 if (node.getReservedContainer() == null) {
   if (Resources.greaterThanOrEqual(calculator, getClusterResource(),
   node.getAvailableResource(), minimumAllocation)) {
 if (LOG.isDebugEnabled()) {
   LOG.debug(Trying to schedule on node:  + node.getNodeName() +
   , available:  + node.getAvailableResource());
 }
 root.assignContainers(clusterResource, node, false);
   }
 } else {
   LOG.info(Skipping scheduling since node  + node.getNodeID() + 
is reserved by application  + 
   
 node.getReservedContainer().getContainerId().getApplicationAttemptId()
   );
 }
 {noformat}
 The code is meant to check if a node has any slots available for containers . 
 Since it uses the greaterThanOrEqual function, we end up in situation where 
 greaterThanOrEqual returns true, even though we may not have enough CPU or 
 memory to actually run the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-10-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2617:
--
Attachment: YARN-2617.5.patch

just added one more log statement myself, pending jenkins

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.5.patch, YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free

2014-10-01 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155398#comment-14155398
 ] 

Varun Vasudev commented on YARN-2628:
-

The release audit error is from a hdfs file and unrelated.

 Capacity scheduler with DominantResourceCalculator carries out reservation 
 even though slots are free
 -

 Key: YARN-2628
 URL: https://issues.apache.org/jira/browse/YARN-2628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2628.0.patch


 We've noticed that if you run the CapacityScheduler with the 
 DominantResourceCalculator, sometimes apps will end up with containers in a 
 reserved state even though free slots are available.
 The root cause seems to be this piece of code from CapacityScheduler.java -
 {noformat}
 // Try to schedule more if there are no reservations to fulfill
 if (node.getReservedContainer() == null) {
   if (Resources.greaterThanOrEqual(calculator, getClusterResource(),
   node.getAvailableResource(), minimumAllocation)) {
 if (LOG.isDebugEnabled()) {
   LOG.debug(Trying to schedule on node:  + node.getNodeName() +
   , available:  + node.getAvailableResource());
 }
 root.assignContainers(clusterResource, node, false);
   }
 } else {
   LOG.info(Skipping scheduling since node  + node.getNodeID() + 
is reserved by application  + 
   
 node.getReservedContainer().getContainerId().getApplicationAttemptId()
   );
 }
 {noformat}
 The code is meant to check if a node has any slots available for containers . 
 Since it uses the greaterThanOrEqual function, we end up in situation where 
 greaterThanOrEqual returns true, even though we may not have enough CPU or 
 memory to actually run the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155400#comment-14155400
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

About the release audit warning, it's also not related.

{quote}
 !? 
/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/hadoop-hdfs-project/hadoop-hdfs/.gitattributes
Lines that start with ? in the release audit report indicate files that do 
not have an Apache license header
{quote}

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, 
 YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, 
 YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.17.patch

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.2-3.patch

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-10-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155467#comment-14155467
 ] 

Karthik Kambatla commented on YARN-2254:


Patch looks mostly good. One nit: Can we rename ALLOC_FILE to FS_ALLOC_FILE and 
test-queues.xml to test-fs-queues.xml to clarify the files are used only 
for FairScheduler? 

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-01 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155473#comment-14155473
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

I cannot reproduce the findbugs warning. Let me check the reason on Jenkins.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155477#comment-14155477
 ] 

Hadoop QA commented on YARN-2617:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672391/YARN-2617.5.patch
  against trunk revision 1f5b42a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager
  org.apache.hadoop.yarn.server.nodemanager.TestEventFlow
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5205//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5205//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5205//console

This message is automatically generated.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.5.patch, YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155481#comment-14155481
 ] 

Hadoop QA commented on YARN-1879:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672394/YARN-1879.17.patch
  against trunk revision 875aa79.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5206//console

This message is automatically generated.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-10-01 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-016.patch

patch -016: includes registry cli patch (-002) of YARN-2616

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, 
 YARN-913-016.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2616) Add CLI client to the registry to list/view entries

2014-10-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155530#comment-14155530
 ] 

Steve Loughran commented on YARN-2616:
--

the patch I just posted doesn't {{stop()}} the registry service, so will leak a 
curator instance/threads.

 Add CLI client to the registry to list/view entries
 ---

 Key: YARN-2616
 URL: https://issues.apache.org/jira/browse/YARN-2616
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Steve Loughran
Assignee: Akshay Radia
 Attachments: yarn-2616-v1.patch, yarn-2616-v2.patch


 registry needs a CLI interface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-10-01 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2254:

Attachment: YARN-2254.004.patch

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-10-01 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1418#comment-1418
 ] 

zhihai xu commented on YARN-2254:
-

Hi [~kasha], Good suggestion, I upload a new patch YARN-2254.004.patch to 
address the comments. 
thanks

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.18.patch

Rebased on trunk.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155565#comment-14155565
 ] 

Hadoop QA commented on YARN-2630:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672374/YARN-2630.4.patch
  against trunk revision 1f5b42a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5204//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5204//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5204//console

This message is automatically generated.

 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, 
 YARN-2630.4.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-10-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155571#comment-14155571
 ] 

Karthik Kambatla commented on YARN-2254:


+1, pending Jenkins. 

I ll commit this later today. 

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free

2014-10-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155584#comment-14155584
 ] 

Jian He commented on YARN-2628:
---

looks good, one minor comment in the test case:
- the following assertion depends on timing, as the allocation happens 
asynchronously, it might fail. could you use a loop to check if the container 
is allocated, otherwise timeout.
{code}
Thread.sleep(1000);
allocResponse = am1.schedule();
Assert.assertEquals(1, allocResponse.getAllocatedContainers().size());
{code}

 Capacity scheduler with DominantResourceCalculator carries out reservation 
 even though slots are free
 -

 Key: YARN-2628
 URL: https://issues.apache.org/jira/browse/YARN-2628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2628.0.patch


 We've noticed that if you run the CapacityScheduler with the 
 DominantResourceCalculator, sometimes apps will end up with containers in a 
 reserved state even though free slots are available.
 The root cause seems to be this piece of code from CapacityScheduler.java -
 {noformat}
 // Try to schedule more if there are no reservations to fulfill
 if (node.getReservedContainer() == null) {
   if (Resources.greaterThanOrEqual(calculator, getClusterResource(),
   node.getAvailableResource(), minimumAllocation)) {
 if (LOG.isDebugEnabled()) {
   LOG.debug(Trying to schedule on node:  + node.getNodeName() +
   , available:  + node.getAvailableResource());
 }
 root.assignContainers(clusterResource, node, false);
   }
 } else {
   LOG.info(Skipping scheduling since node  + node.getNodeID() + 
is reserved by application  + 
   
 node.getReservedContainer().getContainerId().getApplicationAttemptId()
   );
 }
 {noformat}
 The code is meant to check if a node has any slots available for containers . 
 Since it uses the greaterThanOrEqual function, we end up in situation where 
 greaterThanOrEqual returns true, even though we may not have enough CPU or 
 memory to actually run the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2616) Add CLI client to the registry to list/view entries

2014-10-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155624#comment-14155624
 ] 

Steve Loughran commented on YARN-2616:
--

features of 003 patch
# registry instance created via factory
# uses configuration instance built up on command line (though it is also 
creating a {{YarnConfiguration()}} around that.
# pulls out all exception-to-error-text mapping to single method
# covered the current set of errors
# and also log @ debug if enabled.


 Add CLI client to the registry to list/view entries
 ---

 Key: YARN-2616
 URL: https://issues.apache.org/jira/browse/YARN-2616
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Steve Loughran
Assignee: Akshay Radia
 Attachments: YARN-2616-003.patch, yarn-2616-v1.patch, 
 yarn-2616-v2.patch


 registry needs a CLI interface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1414) with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs

2014-10-01 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155638#comment-14155638
 ] 

Joep Rottinghuis commented on YARN-1414:


@sandyr could we get some love on this jira ? We're essentially running with a 
forked Fairscheduler and would like to reduce tech-debt each time we uprev to a 
newer version.

 with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs
 -

 Key: YARN-1414
 URL: https://issues.apache.org/jira/browse/YARN-1414
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, scheduler
Affects Versions: 2.0.5-alpha
Reporter: Siqi Li
Assignee: Siqi Li
 Fix For: 2.2.0

 Attachments: YARN-1221-subtask.v1.patch.txt, YARN-1221-v2.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS

2014-10-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155643#comment-14155643
 ] 

Zhijie Shen commented on YARN-2583:
---

Per discussion offline:

1. In AggregatedLogDeletionService of JHS, we delete the log files of completed 
app, and in AppLogAggregatorImpl of NM, we delete the log files of the running 
LRS. We need to add a test case to verify AggregatedLogDeletionService won't 
delete the running LRS logs. 

2. We apply the same retention policy at both sides, using the time to 
determine what log files need to be deleted.

3. For scalability consideration, let's keep the criteria of the number of logs 
per app, in case the rolling interval is small and too many configuration files 
are generated. But let's keep the config private to AppLogAggregatorImpl.

 Modify the LogDeletionService to support Log aggregation for LRS
 

 Key: YARN-2583
 URL: https://issues.apache.org/jira/browse/YARN-2583
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2583.1.patch


 Currently, AggregatedLogDeletionService will delete old logs from HDFS. It 
 will check the cut-off-time, if all logs for this application is older than 
 this cut-off-time. The app-log-dir from HDFS will be deleted. This will not 
 work for LRS. We expect a LRS application can keep running for a long time. 
 Two different scenarios: 
 1) If we configured the rollingIntervalSeconds, the new log file will be 
 always uploaded to HDFS. The number of log files for this application will 
 become larger and larger. And there is no log files will be deleted.
 2) If we did not configure the rollingIntervalSeconds, the log file can only 
 be uploaded to HDFS after the application is finished. It is very possible 
 that the logs are uploaded after the cut-off-time. It will cause problem 
 because at that time the app-log-dir for this application in HDFS has been 
 deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2637) maximum-am-resource-percent will be violated when resource of AM is minimumAllocation

2014-10-01 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-2637:


 Summary: maximum-am-resource-percent will be violated when 
resource of AM is  minimumAllocation
 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Priority: Critical


Currently, number of AM in leaf queue will be calculated in following way:
{code}
max_am_resource = queue_max_capacity * maximum_am_resource_percent
#max_am_number = max_am_resource / minimum_allocation
#max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
{code}
And when submit new application to RM, it will check if an app can be activated 
in following way:
{code}
for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
 i.hasNext(); ) {
  FiCaSchedulerApp application = i.next();
  
  // Check queue limit
  if (getNumActiveApplications() = getMaximumActiveApplications()) {
break;
  }
  
  // Check user limit
  User user = getUser(application.getUser());
  if (user.getActiveApplications()  getMaximumActiveApplicationsPerUser()) 
{
user.activateApplication();
activeApplications.add(application);
i.remove();
LOG.info(Application  + application.getApplicationId() +
 from user:  + application.getUser() + 
 activated in queue:  + getQueueName());
  }
}
{code}

An example is,
If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
apps can still be activated, and it will occupy all resource of a queue instead 
of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155664#comment-14155664
 ] 

Hadoop QA commented on YARN-913:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672406/YARN-913-016.patch
  against trunk revision 875aa79.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 36 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1266 javac 
compiler warnings (more than the trunk's current 1265 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell

  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5208//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5208//artifact/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5208//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5208//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5208//console

This message is automatically generated.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, 
 YARN-913-016.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155669#comment-14155669
 ] 

Hadoop QA commented on YARN-2254:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672416/YARN-2254.004.patch
  against trunk revision 875aa79.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5209//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5209//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5209//console

This message is automatically generated.

 change TestRMWebServicesAppsModification to support FairScheduler.
 --

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free

2014-10-01 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2628:

Attachment: apache-yarn-2628.1.patch

Uploaded a patch to address [~jianhe]'s comments.

 Capacity scheduler with DominantResourceCalculator carries out reservation 
 even though slots are free
 -

 Key: YARN-2628
 URL: https://issues.apache.org/jira/browse/YARN-2628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2628.0.patch, apache-yarn-2628.1.patch


 We've noticed that if you run the CapacityScheduler with the 
 DominantResourceCalculator, sometimes apps will end up with containers in a 
 reserved state even though free slots are available.
 The root cause seems to be this piece of code from CapacityScheduler.java -
 {noformat}
 // Try to schedule more if there are no reservations to fulfill
 if (node.getReservedContainer() == null) {
   if (Resources.greaterThanOrEqual(calculator, getClusterResource(),
   node.getAvailableResource(), minimumAllocation)) {
 if (LOG.isDebugEnabled()) {
   LOG.debug(Trying to schedule on node:  + node.getNodeName() +
   , available:  + node.getAvailableResource());
 }
 root.assignContainers(clusterResource, node, false);
   }
 } else {
   LOG.info(Skipping scheduling since node  + node.getNodeID() + 
is reserved by application  + 
   
 node.getReservedContainer().getContainerId().getApplicationAttemptId()
   );
 }
 {noformat}
 The code is meant to check if a node has any slots available for containers . 
 Since it uses the greaterThanOrEqual function, we end up in situation where 
 greaterThanOrEqual returns true, even though we may not have enough CPU or 
 memory to actually run the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails

2014-10-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155702#comment-14155702
 ] 

Hudson commented on YARN-2630:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6170 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6170/])
YARN-2630. Prevented previous AM container status from being acquired by the 
current restarted AM. Contributed by Jian He. (zjshen: rev 
52bbe0f11bc8e97df78a1ab9b63f4eff65fd7a76)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatResponsePBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java


 TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
 -

 Key: YARN-2630
 URL: https://issues.apache.org/jira/browse/YARN-2630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, 
 YARN-2630.4.patch


 The problem is that after YARN-1372, in work-preserving AM restart, the 
 re-launched AM will also receive previously failed AM container. But 
 DistributedShell logic is not expecting this extra completed container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-1715) Per queue view in RM is not implemented correctly

2014-10-01 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li resolved YARN-1715.
---
Resolution: Duplicate

 Per queue view in RM is not implemented correctly
 -

 Key: YARN-1715
 URL: https://issues.apache.org/jira/browse/YARN-1715
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li
Assignee: Siqi Li

 For now, per queue view in YARN RM has not yet implemented.
 in RmController.java it only set page title for per queue page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1391) Lost node list should be identify by NodeId

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155705#comment-14155705
 ] 

Hadoop QA commented on YARN-1391:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12618147/YARN-1391.v1.patch
  against trunk revision 52bbe0f.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5216//console

This message is automatically generated.

 Lost node list should be identify by NodeId
 ---

 Key: YARN-1391
 URL: https://issues.apache.org/jira/browse/YARN-1391
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-1391.v1.patch


 in case of multiple node managers on a single machine. each of them should be 
 identified by NodeId, which is more unique than just host name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155707#comment-14155707
 ] 

Hadoop QA commented on YARN-1879:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672418/YARN-1879.18.patch
  against trunk revision 875aa79.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5210//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5210//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5210//console

This message is automatically generated.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155708#comment-14155708
 ] 

Hadoop QA commented on YARN-2312:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672397/YARN-2312.2-3.patch
  against trunk revision 875aa79.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapred.TestMiniMRBringup
  org.apache.hadoop.mapred.TestClusterMapReduceTestCase
  org.apache.hadoop.mapred.TestMRIntermediateDataEncryption
  org.apache.hadoop.mapred.pipes.TestPipeApplication

  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5207//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5207//artifact/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5207//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5207//console

This message is automatically generated.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2254) TestRMWebServicesAppsModification should run against both CS and FS

2014-10-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2254:
---
Summary: TestRMWebServicesAppsModification should run against both CS and 
FS  (was: TestRMWebServicesAppsModification should run against both Capacity 
and FairSchedulers)

 TestRMWebServicesAppsModification should run against both CS and FS
 ---

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2254) TestRMWebServicesAppsModification should run against both Capacity and FairSchedulers

2014-10-01 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2254:
---
Summary: TestRMWebServicesAppsModification should run against both Capacity 
and FairSchedulers  (was: change TestRMWebServicesAppsModification to support 
FairScheduler.)

 TestRMWebServicesAppsModification should run against both Capacity and 
 FairSchedulers
 -

 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
  Labels: test
 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, 
 YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch


 TestRMWebServicesAppsModification skips the test, if the scheduler is not 
 CapacityScheduler.
 change TestRMWebServicesAppsModification to support both CapacityScheduler 
 and FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155715#comment-14155715
 ] 

Hadoop QA commented on YARN-2617:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672391/YARN-2617.5.patch
  against trunk revision dd1b8f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5212//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5212//console

This message is automatically generated.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, 
 YARN-2617.5.patch, YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-10-01 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1198:
--
Attachment: YARN-1198.9.patch

Updated version of .7 patch to current trunk (as .7 now fails to fully apply)

 Capacity Scheduler headroom calculation does not work as expected
 -

 Key: YARN-1198
 URL: https://issues.apache.org/jira/browse/YARN-1198
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Craig Welch
 Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, 
 YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, 
 YARN-1198.8.patch, YARN-1198.9.patch


 Today headroom calculation (for the app) takes place only when
 * New node is added/removed from the cluster
 * New container is getting assigned to the application.
 However there are potentially lot of situations which are not considered for 
 this calculation
 * If a container finishes then headroom for that application will change and 
 should be notified to the AM accordingly.
 * If a single user has submitted multiple applications (app1 and app2) to the 
 same queue then
 ** If app1's container finishes then not only app1's but also app2's AM 
 should be notified about the change in headroom.
 ** Similarly if a container is assigned to any applications app1/app2 then 
 both AM should be notified about their headroom.
 ** To simplify the whole communication process it is ideal to keep headroom 
 per User per LeafQueue so that everyone gets the same picture (apps belonging 
 to same user and submitted in same queue).
 * If a new user submits an application to the queue then all applications 
 submitted by all users in that queue should be notified of the headroom 
 change.
 * Also today headroom is an absolute number ( I think it should be normalized 
 but then this is going to be not backward compatible..)
 * Also  when admin user refreshes queue headroom has to be updated.
 These all are the potential bugs in headroom calculations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >