[jira] [Resolved] (YARN-247) Fair scheduler should support assigning queues by user group

2013-12-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-247.
-

Resolution: Duplicate

 Fair scheduler should support assigning queues by user group
 

 Key: YARN-247
 URL: https://issues.apache.org/jira/browse/YARN-247
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 The MR1 fair scheduler had this capability



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1435) Distributed Shell should not run other commands except sh, and run the custom script at the same time.

2013-12-11 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845171#comment-13845171
 ] 

Zhijie Shen commented on YARN-1435:
---

+1 LGTM

 Distributed Shell should not run other commands except sh, and run the 
 custom script at the same time.
 

 Key: YARN-1435
 URL: https://issues.apache.org/jira/browse/YARN-1435
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Affects Versions: 2.3.0
Reporter: Tassapol Athiapinya
Assignee: Xuan Gong
 Attachments: YARN-1435.1.patch, YARN-1435.1.patch, YARN-1435.2.patch, 
 YARN-1435.3.patch, YARN-1435.4.patch, YARN-1435.4.patch


 Currently, if we want to run custom script at DS. We can do it like this :
 --shell_command sh --shell_script custom_script.sh
 But it may be better to separate running shell_command and shell_script



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (YARN-1494) YarnClient doesn't wrap renewDelegationToken/cancelDelegationToken of ApplicationClientProtocol

2013-12-11 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-1494:
-

 Summary: YarnClient doesn't wrap 
renewDelegationToken/cancelDelegationToken of ApplicationClientProtocol
 Key: YARN-1494
 URL: https://issues.apache.org/jira/browse/YARN-1494
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen


YarnClient doesn't wrap renewDelegationToken/cancelDelegationToken of 
ApplicationClientProtocol, but getDelegationToken. After YARN-1363, 
renewDelegationToken/cancelDelegationToken  are going to be async, such that 
procedure of canceling/renewing a DT is not that straightforward. It's better 
to wrap these two APIs as well.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-312) Add updateNodeResource in ResourceManagerAdministrationProtocol

2013-12-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845226#comment-13845226
 ] 

Junping Du commented on YARN-312:
-

Hi [~vicaya], Would you help to review it again? Thanks!

 Add updateNodeResource in ResourceManagerAdministrationProtocol
 ---

 Key: YARN-312
 URL: https://issues.apache.org/jira/browse/YARN-312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Affects Versions: 2.2.0
Reporter: Junping Du
Assignee: Junping Du
 Attachments: YARN-312-v1.patch, YARN-312-v2.patch, YARN-312-v3.patch, 
 YARN-312-v4.1.patch, YARN-312-v4.patch, YARN-312-v5.1.patch, 
 YARN-312-v5.patch, YARN-312-v6.patch, YARN-312-v7.1.patch, 
 YARN-312-v7.1.patch, YARN-312-v7.patch, YARN-312-v8.patch


 Add fundamental RPC (ResourceManagerAdministrationProtocol) to support node's 
 resource change. For design detail, please refer parent JIRA: YARN-291.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1449) Protocol changes and implementations in NM side to support change container resource

2013-12-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-1449:
-

Attachment: yarn-1449.4.patch

[~sandyr], I've updated my code according to your comments (thanks again!). But 
there're some places I haven't changed.

{quote}
{code}
+// first add all decreased containers to failed, we will support it in the
+// future
{code}
What is it that we will support in the future?
{quote}

Originally, I planned not support container decrease, so I put all decrease 
requests directly to failed list. This patch included decrease requests support 
and such decreased containers will be added to next node heartbeat sent to RM.

{quote}
{code}
+public class ContainerChangeMonitoringEvent extends ContainersMonitorEvent {
+  private final long vmemLimit;
+  private final long pmemLimit;
{code}
No need to calculate / include the vmem here. The containers monitor should 
know about it.
{quote}

I think we needed them. Originally this is created by 
ContainerImpl.LaunchTransition, and ContainersMonitoring will not care about 
this, this is passed in to CsM by ContainerStartMonitoringEvent. I just follow 
the same way of what ContainerStartMonitoringEvent does.

{quote}
Unnecessary change
{code}
-  private static class ProcessTreeInfo {
+  static class ProcessTreeInfo {
{code}
{quote}

This is used for testing, the easiest way to test ContainersMonitoring is 
change pmem/vmem by setting fields in ResourceCalculatorProcessTree. We can get 
ProcessTreeInfo in trackingContainers of ContainersMonitor, but we cannot use 
this class because its private. I don't if I should propose another jira for 
this, but I think this can make our easier to write tests for 
ContainersMonitoring.

{quote}
{code}
+  public void testResourceChange() throws Exception {
{code}
It looks like this test relies on sleeps to wait for events to be handled. This 
both makes the test run longer and can cause flaky failures. I think there is a 
way to actually wait for the events to be completed. I don't know it off the 
top of my head but can look it up for you if that would be helpful.
{quote}

I've thinked about this but not come up with a better way to do it. This will 
be failed only in some extremely bad cases (like very low resource available). 
I've set YarnConfiguration.NM_CONTAINER_MON_INTERVAL_MS to 20ms in test, and 
200ms seems enough for ContainersMonitor react to bad containers. Please let me 
know if you have any other ideas!

And any thoughts about this patch?

 Protocol changes and implementations in NM side to support change container 
 resource
 

 Key: YARN-1449
 URL: https://issues.apache.org/jira/browse/YARN-1449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: yarn-1449.1.patch, yarn-1449.3.patch, yarn-1449.4.patch


 As described in YARN-1197, we need add API/implementation changes,
 1) Add a changeContainersResources method in ContainerManagementProtocol
 2) Can get succeed/failed increased/decreased containers in response of 
 changeContainersResources
 3) Add a new decreased containers field in NodeStatus which can help NM 
 notify RM such changes
 4) Added changeContainersResources implementation in ContainerManagerImpl
 5) Added changes in ContainersMonitorImpl to support change resource limit of 
 containers



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2013-12-11 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845246#comment-13845246
 ] 

Steve Loughran commented on YARN-1492:
--

bq.  obviously: add a specific exception to indicate some kind of race 
condition

bq. I’m a little unsure as to which specific race you’re speaking of, or 
whether you’re talking about a generic exception that can indicate any type of 
race condition. Could you kindly clarify?

Sorry, I should have been clearer. I meant if an exception gets thrown to say 
you detected a race condition, having it something other than generic 
IOException can help identify the problem.

 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling

2013-12-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845261#comment-13845261
 ] 

Sandy Ryza commented on YARN-1404:
--

bq. Other than saying you don't want to wait for impala-under-YARN integration, 
I haven't heard any technical reservations against this approach.
I have no technical reservations with the overall approach.  In fact I'm in 
favor of it.  My points are:
* We will not see this happen for a while and that the original approach on 
this JIRA supports a workaround that has no consequences for clusters not 
running Impala on YARN.
* I'm sure many that would love to take advantage of centrally resource-managed 
HDFS caching will be unwilling to deploy HDFS through YARN.  This will go for 
all sorts of legacy applications as well.  If, beside the changes Arun 
proposed, we can expose YARN's central scheduling independent from its 
deployment/enforcement, there would be a lot to gain. If this is within easy 
reach, I don't find arguments that YARN is philosophically opposed to it or 
that the additional freedom would allow cluster-configurers to shoot themselves 
in the foot satisfying.

I realize that we are rehashing many of the same arguments so I'm not sure how 
to make progress on this. I'll wait until Tucu returns from vacation to push 
further.

 Enable external systems/frameworks to share resources with Hadoop leveraging 
 Yarn resource scheduling
 -

 Key: YARN-1404
 URL: https://issues.apache.org/jira/browse/YARN-1404
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-1404.patch


 Currently Hadoop Yarn expects to manage the lifecycle of the processes its 
 applications run workload in. External frameworks/systems could benefit from 
 sharing resources with other Yarn applications while running their workload 
 within long-running processes owned by the external framework (in other 
 words, running their workload outside of the context of a Yarn container 
 process). 
 Because Yarn provides robust and scalable resource management, it is 
 desirable for some external systems to leverage the resource governance 
 capabilities of Yarn (queues, capacities, scheduling, access control) while 
 supplying their own resource enforcement.
 Impala is an example of such system. Impala uses Llama 
 (http://cloudera.github.io/llama/) to request resources from Yarn.
 Impala runs an impalad process in every node of the cluster, when a user 
 submits a query, the processing is broken into 'query fragments' which are 
 run in multiple impalad processes leveraging data locality (similar to 
 Map-Reduce Mappers processing a collocated HDFS block of input data).
 The execution of a 'query fragment' requires an amount of CPU and memory in 
 the impalad. As the impalad shares the host with other services (HDFS 
 DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications 
 (MapReduce tasks).
 To ensure cluster utilization that follow the Yarn scheduler policies and it 
 does not overload the cluster nodes, before running a 'query fragment' in a 
 node, Impala requests the required amount of CPU and memory from Yarn. Once 
 the requested CPU and memory has been allocated, Impala starts running the 
 'query fragment' taking care that the 'query fragment' does not use more 
 resources than the ones that have been allocated. Memory is book kept per 
 'query fragment' and the threads used for the processing of the 'query 
 fragment' are placed under a cgroup to contain CPU utilization.
 Today, for all resources that have been asked to Yarn RM, a (container) 
 process must be started via the corresponding NodeManager. Failing to do 
 this, will result on the cancelation of the container allocation 
 relinquishing the acquired resource capacity back to the pool of available 
 resources. To avoid this, Impala starts a dummy container process doing 
 'sleep 10y'.
 Using a dummy container process has its drawbacks:
 * the dummy container process is in a cgroup with a given number of CPU 
 shares that are not used and Impala is re-issuing those CPU shares to another 
 cgroup for the thread running the 'query fragment'. The cgroup CPU 
 enforcement works correctly because of the CPU controller implementation (but 
 the formal specified behavior is actually undefined).
 * Impala may ask for CPU and memory independent of each other. Some requests 
 may be only memory with no CPU or viceversa. Because a container requires a 
 process, complete absence of memory or CPU is not possible even if the dummy 
 process is 'sleep', a minimal amount of memory and CPU is required for the 

[jira] [Commented] (YARN-1449) Protocol changes and implementations in NM side to support change container resource

2013-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845281#comment-13845281
 ] 

Hadoop QA commented on YARN-1449:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12618208/yarn-1449.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2644//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2644//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2644//console

This message is automatically generated.

 Protocol changes and implementations in NM side to support change container 
 resource
 

 Key: YARN-1449
 URL: https://issues.apache.org/jira/browse/YARN-1449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: yarn-1449.1.patch, yarn-1449.3.patch, yarn-1449.4.patch


 As described in YARN-1197, we need add API/implementation changes,
 1) Add a changeContainersResources method in ContainerManagementProtocol
 2) Can get succeed/failed increased/decreased containers in response of 
 changeContainersResources
 3) Add a new decreased containers field in NodeStatus which can help NM 
 notify RM such changes
 4) Added changeContainersResources implementation in ContainerManagerImpl
 5) Added changes in ContainersMonitorImpl to support change resource limit of 
 containers



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1449) Protocol changes and implementations in NM side to support change container resource

2013-12-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-1449:
-

Attachment: yarn-1449.5.patch

Updated patch to handle findbugs warning

 Protocol changes and implementations in NM side to support change container 
 resource
 

 Key: YARN-1449
 URL: https://issues.apache.org/jira/browse/YARN-1449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: yarn-1449.1.patch, yarn-1449.3.patch, yarn-1449.4.patch, 
 yarn-1449.5.patch


 As described in YARN-1197, we need add API/implementation changes,
 1) Add a changeContainersResources method in ContainerManagementProtocol
 2) Can get succeed/failed increased/decreased containers in response of 
 changeContainersResources
 3) Add a new decreased containers field in NodeStatus which can help NM 
 notify RM such changes
 4) Added changeContainersResources implementation in ContainerManagerImpl
 5) Added changes in ContainersMonitorImpl to support change resource limit of 
 containers



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1449) Protocol changes and implementations in NM side to support change container resource

2013-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845361#comment-13845361
 ] 

Hadoop QA commented on YARN-1449:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12618222/yarn-1449.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2645//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2645//console

This message is automatically generated.

 Protocol changes and implementations in NM side to support change container 
 resource
 

 Key: YARN-1449
 URL: https://issues.apache.org/jira/browse/YARN-1449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: yarn-1449.1.patch, yarn-1449.3.patch, yarn-1449.4.patch, 
 yarn-1449.5.patch


 As described in YARN-1197, we need add API/implementation changes,
 1) Add a changeContainersResources method in ContainerManagementProtocol
 2) Can get succeed/failed increased/decreased containers in response of 
 changeContainersResources
 3) Add a new decreased containers field in NodeStatus which can help NM 
 notify RM such changes
 4) Added changeContainersResources implementation in ContainerManagerImpl
 5) Added changes in ContainersMonitorImpl to support change resource limit of 
 containers



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1307) Rethink znode structure for RM HA

2013-12-11 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845365#comment-13845365
 ] 

Tsuyoshi OZAWA commented on YARN-1307:
--

@Jian He, could you check the latest patch? Thank you for your reviewing.

 Rethink znode structure for RM HA
 -

 Key: YARN-1307
 URL: https://issues.apache.org/jira/browse/YARN-1307
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1307.1.patch, YARN-1307.10.patch, 
 YARN-1307.11.patch, YARN-1307.2.patch, YARN-1307.3.patch, 
 YARN-1307.4-2.patch, YARN-1307.4-3.patch, YARN-1307.4.patch, 
 YARN-1307.5.patch, YARN-1307.6.patch, YARN-1307.7.patch, YARN-1307.8.patch, 
 YARN-1307.9.patch, YARN-1307.9.patch


 Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, 
 YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in 
 YARN-1222:
 {quote}
 We should move to creating a node hierarchy for apps such that all znodes for 
 an app are stored under an app znode instead of the app root znode. This will 
 help in removeApplication and also in scaling better on ZK. The earlier code 
 was written this way to ensure create/delete happens under a root znode for 
 fencing. But given that we have moved to multi-operations globally, this isnt 
 required anymore.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1172) Convert *SecretManagers in the RM to services

2013-12-11 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845367#comment-13845367
 ] 

Tsuyoshi OZAWA commented on YARN-1172:
--

No problem, Vinod :-)

 Convert *SecretManagers in the RM to services
 -

 Key: YARN-1172
 URL: https://issues.apache.org/jira/browse/YARN-1172
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1172.1.patch, YARN-1172.2.patch, YARN-1172.3.patch, 
 YARN-1172.4.patch, YARN-1172.5.patch, YARN-1172.6.patch, YARN-1172.7.patch, 
 YARN-1172.8.patch






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1481) Move internal services logic from AdminService to ResourceManager

2013-12-11 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1481:
---

Summary: Move internal services logic from AdminService to ResourceManager  
(was: ResourceManager and AdminService interact in a convoluted manner after 
YARN-1318)

 Move internal services logic from AdminService to ResourceManager
 -

 Key: YARN-1481
 URL: https://issues.apache.org/jira/browse/YARN-1481
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: YARN-1481-20131207.txt, YARN-1481-20131209.txt


 This is something I found while reviewing YARN-1318, but didn't halt that 
 patch as many cycles went there already. Some top level issues
  - Not easy to follow RM's service life cycle
 -- RM adds only AdminService as its service directly.
 -- Other services are added to RM when AdminService's init calls 
 RM.activeServices.init()
  - Overall, AdminService shouldn't encompass all of RM's HA state management. 
 It was originally supposed to be the implementation of just the RPC server.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1028) Add FailoverProxyProvider like capability to RMProxy

2013-12-11 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845471#comment-13845471
 ] 

Tom White commented on YARN-1028:
-

It looks like the behaviour in this patch differs from the way failover is 
implemented for HDFS HA, where it is controlled by dfs.client.failover settings 
(e.g. dfs.client.failover.max.attempts is configured explicitly rather than 
being calculated from the IPC settings). Would having the corresponding 
settings for RM HA make sense? (E.g. from a configuration consistency and 
well-tested code path point of view.)

Why do you need both YarnFailoverProxyProvider and 
ConfiguredFailoverProxyProvider? The latter should be sufficient; it might also 
be called RMFailoverProxyProvider.

 Add FailoverProxyProvider like capability to RMProxy
 

 Key: YARN-1028
 URL: https://issues.apache.org/jira/browse/YARN-1028
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Karthik Kambatla
 Attachments: yarn-1028-1.patch, yarn-1028-2.patch, yarn-1028-3.patch, 
 yarn-1028-4.patch, yarn-1028-5.patch, yarn-1028-draft-cumulative.patch


 RMProxy layer currently abstracts RM discovery and implements it by looking 
 up service information from configuration. Motivated by HDFS and using 
 existing classes from Common, we can add failover proxy providers that may 
 provide RM discovery in extensible ways.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1481) Move internal services logic from AdminService to ResourceManager

2013-12-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845475#comment-13845475
 ] 

Hudson commented on YARN-1481:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4864 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4864/])
YARN-1481. Move internal services logic from AdminService to ResourceManager. 
(vinodkv via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1550167)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 Move internal services logic from AdminService to ResourceManager
 -

 Key: YARN-1481
 URL: https://issues.apache.org/jira/browse/YARN-1481
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 2.4.0

 Attachments: YARN-1481-20131207.txt, YARN-1481-20131209.txt


 This is something I found while reviewing YARN-1318, but didn't halt that 
 patch as many cycles went there already. Some top level issues
  - Not easy to follow RM's service life cycle
 -- RM adds only AdminService as its service directly.
 -- Other services are added to RM when AdminService's init calls 
 RM.activeServices.init()
  - Overall, AdminService shouldn't encompass all of RM's HA state management. 
 It was originally supposed to be the implementation of just the RPC server.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

2013-12-11 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845491#comment-13845491
 ] 

Tom White commented on YARN-1029:
-

Implementing ActiveStandbyElector sounds like a good approach, and the patch is 
a good start. From a work sequencing point of view wouldn't it be preferable to 
implement the standalone ZKFC first, since it will share a lot of the code with 
HDFS (i.e. implement the equivalent of DFSZKFailoverController)?

 Allow embedding leader election into the RM
 ---

 Key: YARN-1029
 URL: https://issues.apache.org/jira/browse/YARN-1029
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Karthik Kambatla
 Attachments: yarn-1029-approach.patch


 It should be possible to embed common ActiveStandyElector into the RM such 
 that ZooKeeper based leader election and notification is in-built. In 
 conjunction with a ZK state store, this configuration will be a simple 
 deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2013-12-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845575#comment-13845575
 ] 

Sangjin Lee commented on YARN-1492:
---

[~vinodkv] I'm fine with that. The main reason that I used the HADOOP project 
is because this will result in changes in both the yarn code and the mapreduce 
code. But I think it should be OK to use the YARN project as the place for the 
umbrella JIRA.

 truly shared cache for jars (jobjar/libjar)
 ---

 Key: YARN-1492
 URL: https://issues.apache.org/jira/browse/YARN-1492
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, 
 shared_cache_design_v3.pdf, shared_cache_design_v4.pdf


 Currently there is the distributed cache that enables you to cache jars and 
 files so that attempts from the same job can reuse them. However, sharing is 
 limited with the distributed cache because it is normally on a per-job basis. 
 On a large cluster, sometimes copying of jobjars and libjars becomes so 
 prevalent that it consumes a large portion of the network bandwidth, not to 
 speak of defeating the purpose of bringing compute to where data is. This 
 is wasteful because in most cases code doesn't change much across many jobs.
 I'd like to propose and discuss feasibility of introducing a truly shared 
 cache so that multiple jobs from multiple users can share and cache jars. 
 This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1496) Protocol additions to allow moving apps between queues

2013-12-11 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-1496:
--

Assignee: (was: Jonathan Eagles)

 Protocol additions to allow moving apps between queues
 --

 Key: YARN-1496
 URL: https://issues.apache.org/jira/browse/YARN-1496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Assigned] (YARN-1496) Protocol additions to allow moving apps between queues

2013-12-11 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles reassigned YARN-1496:
-

Assignee: Jonathan Eagles  (was: Sandy Ryza)

 Protocol additions to allow moving apps between queues
 --

 Key: YARN-1496
 URL: https://issues.apache.org/jira/browse/YARN-1496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Jonathan Eagles





--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1307) Rethink znode structure for RM HA

2013-12-11 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845647#comment-13845647
 ] 

Jian He commented on YARN-1307:
---

patch looks good to me, thanks for the updating

 Rethink znode structure for RM HA
 -

 Key: YARN-1307
 URL: https://issues.apache.org/jira/browse/YARN-1307
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1307.1.patch, YARN-1307.10.patch, 
 YARN-1307.11.patch, YARN-1307.2.patch, YARN-1307.3.patch, 
 YARN-1307.4-2.patch, YARN-1307.4-3.patch, YARN-1307.4.patch, 
 YARN-1307.5.patch, YARN-1307.6.patch, YARN-1307.7.patch, YARN-1307.8.patch, 
 YARN-1307.9.patch, YARN-1307.9.patch


 Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, 
 YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in 
 YARN-1222:
 {quote}
 We should move to creating a node hierarchy for apps such that all znodes for 
 an app are stored under an app znode instead of the app root znode. This will 
 help in removeApplication and also in scaling better on ZK. The earlier code 
 was written this way to ensure create/delete happens under a root znode for 
 fencing. But given that we have moved to multi-operations globally, this isnt 
 required anymore.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-679) add an entry point that can start any Yarn service

2013-12-11 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845662#comment-13845662
 ] 

Arun C Murthy commented on YARN-679:


+1 for the direction! Thanks [~ste...@apache.org]!

 add an entry point that can start any Yarn service
 --

 Key: YARN-679
 URL: https://issues.apache.org/jira/browse/YARN-679
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Attachments: YARN-679-001.patch


 There's no need to write separate .main classes for every Yarn service, given 
 that the startup mechanism should be identical: create, init, start, wait for 
 stopped -with an interrupt handler to trigger a clean shutdown on a control-c 
 interrrupt.
 Provide one that takes any classname, and a list of config files/options



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling

2013-12-11 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845667#comment-13845667
 ] 

Arun C Murthy commented on YARN-1404:
-

bq. I have no technical reservations with the overall approach.

Since we agree on the approach and the direction we want to go; perhaps, we can 
now discuss how to get there?

We don't have to implement everything in the first go, we just need to 
implement enough to solve your goals of quick integration while being on the 
long-term path we want to get to. 

Does that make sense?

 Enable external systems/frameworks to share resources with Hadoop leveraging 
 Yarn resource scheduling
 -

 Key: YARN-1404
 URL: https://issues.apache.org/jira/browse/YARN-1404
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-1404.patch


 Currently Hadoop Yarn expects to manage the lifecycle of the processes its 
 applications run workload in. External frameworks/systems could benefit from 
 sharing resources with other Yarn applications while running their workload 
 within long-running processes owned by the external framework (in other 
 words, running their workload outside of the context of a Yarn container 
 process). 
 Because Yarn provides robust and scalable resource management, it is 
 desirable for some external systems to leverage the resource governance 
 capabilities of Yarn (queues, capacities, scheduling, access control) while 
 supplying their own resource enforcement.
 Impala is an example of such system. Impala uses Llama 
 (http://cloudera.github.io/llama/) to request resources from Yarn.
 Impala runs an impalad process in every node of the cluster, when a user 
 submits a query, the processing is broken into 'query fragments' which are 
 run in multiple impalad processes leveraging data locality (similar to 
 Map-Reduce Mappers processing a collocated HDFS block of input data).
 The execution of a 'query fragment' requires an amount of CPU and memory in 
 the impalad. As the impalad shares the host with other services (HDFS 
 DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications 
 (MapReduce tasks).
 To ensure cluster utilization that follow the Yarn scheduler policies and it 
 does not overload the cluster nodes, before running a 'query fragment' in a 
 node, Impala requests the required amount of CPU and memory from Yarn. Once 
 the requested CPU and memory has been allocated, Impala starts running the 
 'query fragment' taking care that the 'query fragment' does not use more 
 resources than the ones that have been allocated. Memory is book kept per 
 'query fragment' and the threads used for the processing of the 'query 
 fragment' are placed under a cgroup to contain CPU utilization.
 Today, for all resources that have been asked to Yarn RM, a (container) 
 process must be started via the corresponding NodeManager. Failing to do 
 this, will result on the cancelation of the container allocation 
 relinquishing the acquired resource capacity back to the pool of available 
 resources. To avoid this, Impala starts a dummy container process doing 
 'sleep 10y'.
 Using a dummy container process has its drawbacks:
 * the dummy container process is in a cgroup with a given number of CPU 
 shares that are not used and Impala is re-issuing those CPU shares to another 
 cgroup for the thread running the 'query fragment'. The cgroup CPU 
 enforcement works correctly because of the CPU controller implementation (but 
 the formal specified behavior is actually undefined).
 * Impala may ask for CPU and memory independent of each other. Some requests 
 may be only memory with no CPU or viceversa. Because a container requires a 
 process, complete absence of memory or CPU is not possible even if the dummy 
 process is 'sleep', a minimal amount of memory and CPU is required for the 
 dummy process.
 Because of this it is desirable to be able to have a container without a 
 backing process.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-408) Capacity Scheduler delay scheduling should not be disabled by default

2013-12-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845702#comment-13845702
 ] 

Hudson commented on YARN-408:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #4867 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4867/])
YARN-408. Change CapacityScheduler to not disable delay-scheduling by default. 
Contributed by Mayank Bansal. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1550245)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


 Capacity Scheduler delay scheduling should not be disabled by default
 -

 Key: YARN-408
 URL: https://issues.apache.org/jira/browse/YARN-408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Mayank Bansal
Assignee: Mayank Bansal
Priority: Minor
 Fix For: 2.4.0

 Attachments: YARN-408-trunk-2.patch, YARN-408-trunk-3.patch, 
 YARN-408-trunk.patch


 Capacity Scheduler delay scheduling should not be disabled by default.
 Enabling it to number of nodes in one rack.
 Thanks,
 Mayank



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Assigned] (YARN-1496) Protocol additions to allow moving apps between queues

2013-12-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned YARN-1496:


Assignee: Sandy Ryza

 Protocol additions to allow moving apps between queues
 --

 Key: YARN-1496
 URL: https://issues.apache.org/jira/browse/YARN-1496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-1496.patch






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1496) Protocol additions to allow moving apps between queues

2013-12-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-1496:
-

Attachment: YARN-1496.patch

 Protocol additions to allow moving apps between queues
 --

 Key: YARN-1496
 URL: https://issues.apache.org/jira/browse/YARN-1496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Sandy Ryza
 Attachments: YARN-1496.patch






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1496) Protocol additions to allow moving apps between queues

2013-12-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845706#comment-13845706
 ] 

Sandy Ryza commented on YARN-1496:
--

Here's a patch with a sketch of the protocol changes.  I still need to add some 
tests and doc.

 Protocol additions to allow moving apps between queues
 --

 Key: YARN-1496
 URL: https://issues.apache.org/jira/browse/YARN-1496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-1496.patch






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (YARN-1498) RM changes for moving apps between queues

2013-12-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-1498:


 Summary: RM changes for moving apps between queues
 Key: YARN-1498
 URL: https://issues.apache.org/jira/browse/YARN-1498
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (YARN-1499) Fair Scheduler changes for moving apps between queues

2013-12-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-1499:


 Summary: Fair Scheduler changes for moving apps between queues
 Key: YARN-1499
 URL: https://issues.apache.org/jira/browse/YARN-1499
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (YARN-1497) Expose moving apps between queues on the command line

2013-12-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-1497:


 Summary: Expose moving apps between queues on the command line
 Key: YARN-1497
 URL: https://issues.apache.org/jira/browse/YARN-1497
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1481) Move internal services logic from AdminService to ResourceManager

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845788#comment-13845788
 ] 

Vinod Kumar Vavilapalli commented on YARN-1481:
---

bq. One minor nit: AdminService#isRMActive() need not be synchronized. I am 
okay with addressing the nit in another HA JIRA - may be, YARN-1029.
Sorry missed it yesterday. Sure, let's do it in one of the other JIRAs.

 Move internal services logic from AdminService to ResourceManager
 -

 Key: YARN-1481
 URL: https://issues.apache.org/jira/browse/YARN-1481
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 2.4.0

 Attachments: YARN-1481-20131207.txt, YARN-1481-20131209.txt


 This is something I found while reviewing YARN-1318, but didn't halt that 
 patch as many cycles went there already. Some top level issues
  - Not easy to follow RM's service life cycle
 -- RM adds only AdminService as its service directly.
 -- Other services are added to RM when AdminService's init calls 
 RM.activeServices.init()
  - Overall, AdminService shouldn't encompass all of RM's HA state management. 
 It was originally supposed to be the implementation of just the RPC server.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845858#comment-13845858
 ] 

Vinod Kumar Vavilapalli commented on YARN-1029:
---

I think so too. Typical installations will need YARN-1177 more than this 
option? So sequence that first?

Re this patch, seems like embedding ZKFC is beneficial.

bq. However, automatic failover fails to take over after an explicit manual 
failover. To address this RMActiveStandbyElector should implement ZKFCProtocol 
and RMHAServiceTarget#getZKFCProxy should return a proxy to this.
bq. In addition to ActiveStandbyElector, ZKFC has other overheads - health 
monitoring, fencing etc. which might not be required in a simple embedded 
option.I see that 
ZKFC = health-monitoring + leader election + graceful failover (ZKFCProtocol). 
Seems like for the embedded case, we want to use leader-election + fencing. To 
that end, may be we should refactor ZKFC itself for reuse?

bq. ZKFC communicates to the RM through RPC; when embedded, both are in the 
same process.
We've done similar local RPC short-circuits for token renewal. That should fix 
it?

bq. ZKFC#formatZK() needs to be exposed through rmadmin, which complicates it 
further.
If I understand it correctly, it can be implemented as a standalone command 
instead of a RMAdmin call. Right?

 Allow embedding leader election into the RM
 ---

 Key: YARN-1029
 URL: https://issues.apache.org/jira/browse/YARN-1029
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Karthik Kambatla
 Attachments: yarn-1029-approach.patch


 It should be possible to embed common ActiveStandyElector into the RM such 
 that ZooKeeper based leader election and notification is in-built. In 
 conjunction with a ZK state store, this configuration will be a simple 
 deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1493) Schedulers don't recognize apps separately from app-attempts

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1493:
--

Description: 
Today, scheduler is tied to attempt only.

We need to separate app-level handling logic in scheduler. We can add new 
app-level events to the scheduler and separate the app-level logic out. This is 
good for work-preserving AM restart, RM restart, and also needed for 
differentiating app-level metrics and attempt-level metrics.

  was:Today, scheduler is tied to attempt only. We can add new app-level events 
to the scheduler and separate the app-level logic out. This is good for 
work-preserving AM restart, RM restart, and also needed for differentiating 
app-level metrics and attempt-level metrics.

Summary: Schedulers don't recognize apps separately from app-attempts  
(was: Separate app-level handling logic in scheduler )

 Schedulers don't recognize apps separately from app-attempts
 

 Key: YARN-1493
 URL: https://issues.apache.org/jira/browse/YARN-1493
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He

 Today, scheduler is tied to attempt only.
 We need to separate app-level handling logic in scheduler. We can add new 
 app-level events to the scheduler and separate the app-level logic out. This 
 is good for work-preserving AM restart, RM restart, and also needed for 
 differentiating app-level metrics and attempt-level metrics.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1311) Fix app specific scheduler-events' names to be app-attempt based

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1311:
--

Attachment: YARN-1311-20131211.txt

Patch updated to trunk.

 Fix app specific scheduler-events' names to be app-attempt based
 

 Key: YARN-1311
 URL: https://issues.apache.org/jira/browse/YARN-1311
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Trivial
 Attachments: YARN-1311-20131015.txt, YARN-1311-20131211.txt


 Today, APP_ADDED and APP_REMOVED are sent to the scheduler. They are 
 misnomers as schedulers only deal with AppAttempts today. This JIRA is for 
 fixing their names so that we can add App-level events in the near future, 
 notably for work-preserving RM-restart.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1311) Fix app specific scheduler-events' names to be app-attempt based

2013-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845951#comment-13845951
 ] 

Hadoop QA commented on YARN-1311:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12618329/YARN-1311-20131211.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 8 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2647//console

This message is automatically generated.

 Fix app specific scheduler-events' names to be app-attempt based
 

 Key: YARN-1311
 URL: https://issues.apache.org/jira/browse/YARN-1311
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Trivial
 Attachments: YARN-1311-20131015.txt, YARN-1311-20131211.txt


 Today, APP_ADDED and APP_REMOVED are sent to the scheduler. They are 
 misnomers as schedulers only deal with AppAttempts today. This JIRA is for 
 fixing their names so that we can add App-level events in the near future, 
 notably for work-preserving RM-restart.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845953#comment-13845953
 ] 

Vinod Kumar Vavilapalli commented on YARN-1404:
---

I just caught up with YARN-1197. Seems like some part of that solution is very 
relevant to this JIRA. For e.g.,
bq. Some daemon-based applications may want to start exactly one daemon in 
allocated node (like OpenMPI), such daemon will launch/monitoring workers (like 
MPI processes) itself. We can first allocate some containers for daemons, and 
adjust their size as application’s requirement. This will make YARN support 
two-staged scheduling. Described in YARN-1197

 Enable external systems/frameworks to share resources with Hadoop leveraging 
 Yarn resource scheduling
 -

 Key: YARN-1404
 URL: https://issues.apache.org/jira/browse/YARN-1404
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-1404.patch


 Currently Hadoop Yarn expects to manage the lifecycle of the processes its 
 applications run workload in. External frameworks/systems could benefit from 
 sharing resources with other Yarn applications while running their workload 
 within long-running processes owned by the external framework (in other 
 words, running their workload outside of the context of a Yarn container 
 process). 
 Because Yarn provides robust and scalable resource management, it is 
 desirable for some external systems to leverage the resource governance 
 capabilities of Yarn (queues, capacities, scheduling, access control) while 
 supplying their own resource enforcement.
 Impala is an example of such system. Impala uses Llama 
 (http://cloudera.github.io/llama/) to request resources from Yarn.
 Impala runs an impalad process in every node of the cluster, when a user 
 submits a query, the processing is broken into 'query fragments' which are 
 run in multiple impalad processes leveraging data locality (similar to 
 Map-Reduce Mappers processing a collocated HDFS block of input data).
 The execution of a 'query fragment' requires an amount of CPU and memory in 
 the impalad. As the impalad shares the host with other services (HDFS 
 DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications 
 (MapReduce tasks).
 To ensure cluster utilization that follow the Yarn scheduler policies and it 
 does not overload the cluster nodes, before running a 'query fragment' in a 
 node, Impala requests the required amount of CPU and memory from Yarn. Once 
 the requested CPU and memory has been allocated, Impala starts running the 
 'query fragment' taking care that the 'query fragment' does not use more 
 resources than the ones that have been allocated. Memory is book kept per 
 'query fragment' and the threads used for the processing of the 'query 
 fragment' are placed under a cgroup to contain CPU utilization.
 Today, for all resources that have been asked to Yarn RM, a (container) 
 process must be started via the corresponding NodeManager. Failing to do 
 this, will result on the cancelation of the container allocation 
 relinquishing the acquired resource capacity back to the pool of available 
 resources. To avoid this, Impala starts a dummy container process doing 
 'sleep 10y'.
 Using a dummy container process has its drawbacks:
 * the dummy container process is in a cgroup with a given number of CPU 
 shares that are not used and Impala is re-issuing those CPU shares to another 
 cgroup for the thread running the 'query fragment'. The cgroup CPU 
 enforcement works correctly because of the CPU controller implementation (but 
 the formal specified behavior is actually undefined).
 * Impala may ask for CPU and memory independent of each other. Some requests 
 may be only memory with no CPU or viceversa. Because a container requires a 
 process, complete absence of memory or CPU is not possible even if the dummy 
 process is 'sleep', a minimal amount of memory and CPU is required for the 
 dummy process.
 Because of this it is desirable to be able to have a container without a 
 backing process.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845959#comment-13845959
 ] 

Vinod Kumar Vavilapalli commented on YARN-1197:
---

I just caught up with this. Well written document, thanks! Some questions:
 - Decreasing resources:
-- Seems like the control flow is asymetrical for resource decrease. We 
directly go to the node first. Is that intended? On first look, that seems fine 
- decreasing resource usage on a node is akin to killing a container by talking 
to NM directly.
-- In such applications that decrease container-resource, will the 
application first instruct its container to reduce the resource usage and then 
inform the platform? The reason this is important is if it doesn't happen that 
way, node will forcefully either kill it when monitoring resource usage or 
change its cgroup immediately causing the container to swap.

Also, I can see that some of the scheduler changes are going to be pretty 
involved. I'd also vote for a branch. A couple of patches already went in and 
I'm not even sure we already got them right and/or if they need more revisions 
as we start making core changes. To avoid branch-rot, we could target a subset, 
say just the resource-increase changes in the branch and do the remaining work 
on trunk after merge.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845982#comment-13845982
 ] 

Sandy Ryza commented on YARN-1197:
--

bq. Seems like the control flow is asymetrical for resource decrease. We 
directly go to the node first. Is that intended? On first look, that seems fine 
- decreasing resource usage on a node is akin to killing a container by talking 
to NM directly.
This is intentional - we went through a few different flows before settling on 
this approach.  The analogy with killing the container was one of the reasons 
for this.

bq. In such applications that decrease container-resource, will the application 
first instruct its container to reduce the resource usage and then inform the 
platform? The reason this is important is if it doesn't happen that way, node 
will forcefully either kill it when monitoring resource usage or change its 
cgroup immediately causing the container to swap.
When reducing memory, the application should inform the container process 
before informing the NodeManager.  When only reducing CPU, there will probably 
be situations where only informing the platform is necessary.

bq. To avoid branch-rot, we could target a subset, say just the 
resource-increase changes in the branch and do the remaining work on trunk 
after merge.
Sounds reasonable to me.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845988#comment-13845988
 ] 

Wangda Tan commented on YARN-1197:
--

Hi Vinod,
Thanks for jumping in, my idea on your questions,

{quote}
Seems like the control flow is asymetrical for resource decrease. We directly 
go to the node first. Is that intended? On first look, that seems fine - 
decreasing resource usage on a node is akin to killing a container by talking 
to NM directly.
{quote}
Yes, we discussed this in this Jira (credit to Bikas, Sandy and Tucu), I think 
decreasing resource is a similar operation comparing to kill a container

{quote}
In such applications that decrease container-resource, will the application 
first instruct its container to reduce the resource usage and then inform the 
platform? The reason this is important is if it doesn't happen that way, node 
will forcefully either kill it when monitoring resource usage or change its 
cgroup immediately causing the container to swap.
{quote}
Yes I think, AM should notify NM about this when it make sure resource usage is 
already reduced in containers to avoid container killed by NM.

I support to move this to a branch to make it nice completed before merging it 
to trunk.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845990#comment-13845990
 ] 

Wangda Tan commented on YARN-1197:
--

Sorry I missed the reply from Sandy :-p

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (YARN-1500) The num of active/pending apps in fair scheduler app queue is wrong

2013-12-11 Thread Siqi Li (JIRA)
Siqi Li created YARN-1500:
-

 Summary: The num of active/pending apps in fair scheduler app 
queue is wrong
 Key: YARN-1500
 URL: https://issues.apache.org/jira/browse/YARN-1500
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.4-alpha
Reporter: Siqi Li
Assignee: Siqi Li
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1500) The num of active/pending apps in fair scheduler app queue is wrong

2013-12-11 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-1500:
--

Attachment: 4E7261C9-0FD4-40BA-93F3-4CB3D538EBAE.png
B55C71D0-3BD2-4BE1-8433-1C59FE21B110.png

 The num of active/pending apps in fair scheduler app queue is wrong
 ---

 Key: YARN-1500
 URL: https://issues.apache.org/jira/browse/YARN-1500
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.4-alpha
Reporter: Siqi Li
Assignee: Siqi Li
Priority: Minor
 Attachments: 4E7261C9-0FD4-40BA-93F3-4CB3D538EBAE.png, 
 B55C71D0-3BD2-4BE1-8433-1C59FE21B110.png






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container

2013-12-11 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846001#comment-13846001
 ] 

Henry Robinson commented on YARN-1488:
--

Arun, thanks for filing this. I have a few questions so that I can understand 
the proposal more concretely (I'm viewing this in the context of YARN-1404; 
i.e. a server process that wants to obtain resources for workers that may not 
have a 1-to-1 relationship with processes):

Would the recipient and delegated containers have to match the queues to which 
their original resources were granted? If the target is the server process, and 
the source is a set of resources granted for a single worker, the queues would 
likely be different.

* If the queues do not have to match, then presumably the target container is 
the server's. Would the server process now have to track all resources 
delegated to it? I'm thinking that pre-empting a delegated container would 
require the server process to ensure that it no longer assigns resources to the 
relevant worker. This does not affect YARN's correctness, since the resources 
will be revoked no matter what, but otherwise increases the likelihood that 
YARN's understanding of the resource map of the cluster is inaccurate, which is 
not good for anyone. 

* If the queues *must* match, then presumably the target container is supposed 
to be for the worker, not the server (because server and worker occupy 
different queues). But if that's the case, the target container must already 
exist, which makes it seem like YARN would require a process to be running in 
it, getting us back to the start with our requirements for Impala, which is 
that we'd like to have one container per query but we use that container by 
assigning threads to it dynamically, rather than whole processes.

I like the delegation idea as a mechanism to coalesce resource requests into 
one. But I don't yet understand how this allows us to maintain one 
cgroup-per-worker without a dedicated worker process, unless the server process 
creates a hierarchy of cgroups underneath it, one for each worker, and 
physically delegates resources that way. This (or an alternative approach where 
all workers run inside the same monolithic server cgroup, and the server 
schedules them in user-land itself) places some pretty hefty requirements on 
the framework to avoid misusing the resources that YARN grants it.

 Allow containers to delegate resources to another container
 ---

 Key: YARN-1488
 URL: https://issues.apache.org/jira/browse/YARN-1488
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy

 We should allow containers to delegate resources to another container. This 
 would allow external frameworks to share not just YARN's resource-management 
 capabilities but also it's workload-management capabilities.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (YARN-1311) Fix app specific scheduler-events' names to be app-attempt based

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1311:
--

Attachment: YARN-1311-20131211.1.txt

 Fix app specific scheduler-events' names to be app-attempt based
 

 Key: YARN-1311
 URL: https://issues.apache.org/jira/browse/YARN-1311
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Trivial
 Attachments: YARN-1311-20131015.txt, YARN-1311-20131211.1.txt, 
 YARN-1311-20131211.txt


 Today, APP_ADDED and APP_REMOVED are sent to the scheduler. They are 
 misnomers as schedulers only deal with AppAttempts today. This JIRA is for 
 fixing their names so that we can add App-level events in the near future, 
 notably for work-preserving RM-restart.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1311) Fix app specific scheduler-events' names to be app-attempt based

2013-12-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846020#comment-13846020
 ] 

Hadoop QA commented on YARN-1311:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12618344/YARN-1311-20131211.1.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 8 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2648//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2648//console

This message is automatically generated.

 Fix app specific scheduler-events' names to be app-attempt based
 

 Key: YARN-1311
 URL: https://issues.apache.org/jira/browse/YARN-1311
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Trivial
 Attachments: YARN-1311-20131015.txt, YARN-1311-20131211.1.txt, 
 YARN-1311-20131211.txt


 Today, APP_ADDED and APP_REMOVED are sent to the scheduler. They are 
 misnomers as schedulers only deal with AppAttempts today. This JIRA is for 
 fixing their names so that we can add App-level events in the near future, 
 notably for work-preserving RM-restart.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846027#comment-13846027
 ] 

Vinod Kumar Vavilapalli commented on YARN-1488:
---

I commented about this on  YARN-1404 too. It's likely that YARN-1488 (this 
JIRA) and/or YARN-1404 have to work in unison with YARN-1197.

 Allow containers to delegate resources to another container
 ---

 Key: YARN-1488
 URL: https://issues.apache.org/jira/browse/YARN-1488
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy

 We should allow containers to delegate resources to another container. This 
 would allow external frameworks to share not just YARN's resource-management 
 capabilities but also it's workload-management capabilities.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (YARN-1501) Fair Scheduler will NPE if it hits IOException on queue assignment

2013-12-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-1501:


 Summary: Fair Scheduler will NPE if it hits IOException on queue 
assignment
 Key: YARN-1501
 URL: https://issues.apache.org/jira/browse/YARN-1501
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
Reporter: Sandy Ryza


{code}
try {
  QueuePlacementPolicy placementPolicy = allocConf.getPlacementPolicy();
  queueName = placementPolicy.assignAppToQueue(queueName, user);
  if (queueName == null) {
return null;
  }
  queue = queueMgr.getLeafQueue(queueName, true);
} catch (IOException ex) {
  LOG.error(Error assigning app to queue, rejecting, ex);
}

if (rmApp != null) {
  rmApp.setQueue(queue.getName());
} else {
  LOG.warn(Couldn't find RM app to set queue name on);
}
{code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1391) Lost node list should be identify by NodeId

2013-12-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846048#comment-13846048
 ] 

Junping Du commented on YARN-1391:
--

 in case of multiple node managers on a single machine.
Can you provide more details on real scenarios for multiple NMs on a single 
machine (beside test purpose)?

 Lost node list should be identify by NodeId
 ---

 Key: YARN-1391
 URL: https://issues.apache.org/jira/browse/YARN-1391
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-1391.v1.patch


 in case of multiple node managers on a single machine. each of them should be 
 identified by NodeId, which is more unique than just host name



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846064#comment-13846064
 ] 

Sandy Ryza commented on YARN-1197:
--

Wanted to post some thoughts on this vs. YARN-1488.  YARN-1488 proposes that if 
you receive a container on a node you should be able to delegate it to another 
container already running on that node, essentially adding the resources of the 
container you received to the allocation of the running container.  This sounds 
a lot like a resource increase.  The differences are that:
* With the mechanism proposed in this JIRA, the request is explicitly an 
increase and mentions the container you want to add it to.   This allows the 
scheduler to use special logic for handling increase requests.
* With the mechanism proposed on YARN-1488, a container can be used to increase 
the resources of a container from another application.
* With the mechanism proposed here, after satisfying the increase request, the 
scheduler is tracking a single larger container, not multiple small ones.

An advantage of treating an increase request the same as a regular container 
request is that an application could submit it at the same time.  I.e. if I 
want a single container with as many resources as possible on node X, I can 
request a number of containers on that node, wait for some time period for 
allocations to accrue, and then run them all as a single container.

I think the deciding factor might be how preemption functions here.  I.e. what 
is the preemptable unit - can we preempt a part of a container?  Have some 
thoughts but will try to organize them more before posting here.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

2013-12-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846072#comment-13846072
 ] 

Vinod Kumar Vavilapalli commented on YARN-1489:
---

bq. Would be good to see an overall design document..
Yup, writing something up..

 [Umbrella] Work-preserving ApplicationMaster restart
 

 Key: YARN-1489
 URL: https://issues.apache.org/jira/browse/YARN-1489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Today if AMs go down,
  - RM kills all the containers of that ApplicationAttempt
  - New ApplicationAttempt doesn't know where the previous containers are 
 running
  - Old running containers don't know where the new AM is running.
 We need to fix this to enable work-preserving AM restart. The later two 
 potentially can be done at the app level, but it is good to have a common 
 solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846075#comment-13846075
 ] 

Bikas Saha commented on YARN-1197:
--

I am afraid we seem to mixing things over here. This jira deals with the issue 
of increasing and decreasing the resources of an allocated container. There are 
clear use cases for it like mentioned in previous comments on this jira. e.g. 
having a long running worker daemon increase and decrease its resources 
depending on load. We have already discussed at length on this jira on how 
increasing a container resource is internally no different than requesting for 
an additional container and merging it with an existing container. However the 
new container followed by merge is way more complicated for the user and adds 
additional complexity to the system (eg how to deal with the new container that 
was merged into the old one). This complexity is in addition to the common work 
with simply increasing resources on a given container. wrt the user, asking for 
a container and being able to increase its resources will give the same effect 
as asking for many containers and merging them.
The scenario for YARN-1488 is logically different. That covers the case when an 
app wants to use a shared service and purchases that service by transferring 
its own container resource to that shared service that itself is running inside 
YARN. The consumer app may never need to increase its own container resource. 
Secondly, the shared service is not requesting an increase in its own container 
resources. So this jira does not come into the picture at all.

I believe we have a clear and cleanly separated piece of useful functionality 
being implemented in this jira. We should go ahead and bring this work to 
completion and facilitate the creation of long running services in YARN.

wrt doing this in a branch. There are new API's being added here for which 
functionality does not exist or is not supported yet. And none of that code 
will get executed until clients actually support doing it or someone writes 
code against it. So I dont think that any of this is going to destabilize the 
code base. I agree that the scheduler changes are going to be complicated. We 
can do them in the end when all the plumbing is in place and they could be 
separate jiras for each scheduler. Of course, schedulers would want their own 
flags to turn this on/off. So its not clear to me what benefits a branch would 
bring here but it would entail the overhead of maintenance and lack of test 
automation. Does this mean that every feature addition to YARN needs to be done 
in a branch? I propose we do this work in trunk and later merge it into 
branch-2 when we are satisfied with its stability.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2013-12-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846084#comment-13846084
 ] 

Wangda Tan commented on YARN-1197:
--

My thoughts after reading Bikas and Sandy's comments,
I think the two Jiras are tackling different problem, and like what Sandy said, 
the container increase/decrease can give scheduler more inputs to optimize such 
increase requests (like user can configure increase request priority higher 
than normal container request priority, which can faster handle such increase 
operation to support low latency services). 
And I think if user want to use container delegation instead of container 
increasing just to make more dynamic resource (not shared between 
users/applications), I've some concerns beyond Sandy's thinking.

* If we treat all delegated resource as a same container, why not simplely 
merge them (like what I originally proposed in this Jira), merge them will make 
us easier to manage preemption, etc.
* How to deal with running container merge (a container is already running, and 
what should be happen if we delegate its resource to another container). If we 
can only delegate container in ACQUIRED state, we need to deal with timeout to 
launch a container before it taken back by RM
* Can we do container de-delegation?

In general, I support to use container increase/decrease to deal with resource 
changing within a single application.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: mapreduce-project.patch.ver.1, 
 tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, 
 yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
 yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
 yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
 yarn-server-resourcemanager.patch.ver.1


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-12-11 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846105#comment-13846105
 ] 

Bikas Saha commented on YARN-674:
-

the only nit I have is that sending the recover/start event to the app is now 
obscured inside the delegation token renewer.

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Fix For: 2.4.0

 Attachments: YARN-674.1.patch, YARN-674.10.patch, YARN-674.2.patch, 
 YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, 
 YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)