[jira] [Updated] (YARN-4053) Change the way metric values are stored in HBase Storage

2015-08-16 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-4053:
---
Attachment: YARN-4053-YARN-2928.01.patch

 Change the way metric values are stored in HBase Storage
 

 Key: YARN-4053
 URL: https://issues.apache.org/jira/browse/YARN-4053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-4053-YARN-2928.01.patch


 Currently HBase implementation uses GenericObjectMapper to convert and store 
 values in backend HBase storage. This converts everything into a string 
 representation(ASCII/UTF-8 encoded byte array).
 While this is fine in most cases, it does not quite serve our use case for 
 metrics. 
 So we need to decide how are we going to encode and decode metric values and 
 store them in HBase.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.

2015-08-16 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3224:
--
Attachment: 0001-YARN-3224.patch

Attaching a work in progress patch. As this patch is directly dependent on 
YARN-3212, I will rebase the same as it gets committed.

Also current preemption framework doesnt have support to inform AM about 
timeout of to-be-preempted container. As YARN-3784 gets in, we can leverage the 
same here. 

 Notify AM with containers (on decommissioning node) could be preempted after 
 timeout.
 -

 Key: YARN-3224
 URL: https://issues.apache.org/jira/browse/YARN-3224
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Sunil G
 Attachments: 0001-YARN-3224.patch


 We should leverage YARN preemption framework to notify AM that some 
 containers will be preempted after a timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4039) New AM instances waste resource by waiting only for resource availability when all available resources are already used

2015-08-16 Thread Sadayuki Furuhashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sadayuki Furuhashi updated YARN-4039:
-
Assignee: Tsuyoshi Ozawa  (was: Sadayuki Furuhashi)

 New AM instances waste resource by waiting only for resource availability 
 when all available resources are already used
 ---

 Key: YARN-4039
 URL: https://issues.apache.org/jira/browse/YARN-4039
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0
Reporter: Sadayuki Furuhashi
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-4039.1.patch, YARN-4039.2.patch


 Problem:
 In FairScheduler, maxRunningApps doesn't work well if we can't predict size 
 of an application in a queue because small maxRunningApps can't use all 
 resources if many small applications are issued, while large maxRunningApps 
 wastes resources if large applications run.
 Background:
 We're using FairScheduler. In following scenario, AM instances wastes 
 resources significantly:
 * A queue has X MB of capacity.
 * An application requests 32 containers where a container requires (X / 32) 
 MB of memory
 ** In this case, a single application occupies entire resource of the queue.
 * Many those applications are issued (10 applications)
 * Ideal behavior is that applications run one by one to maximize throughput.
 * However, all applications run simultaneously. As the result, AM instances 
 occupy resources and prevent other tasks from starting. At worst case, most 
 of resources are occupied by waiting AMs and applications progress very 
 slowly.
 A solution is setting maxRunningApps to 1 or 2. However, it doesn't work well 
 if following workload exists at the same queue:
 * An application requests 2 containers where a container requires (X / 32) MB 
 of memory
 * Many those applications are issued (say, 10 applications)
 * Ideal behavior is that all applications run simultaneously to maximize 
 concurrency and throughput.
 * However, number of applications are limited by maxRunningApps. At worst 
 case, most of resources are idling.
 This problem happens especially with Hive because we can't estimate size of a 
 MapReduce application.
 Solution:
 AM doesn't have to start if there are waiting resource requests because the 
 AM can't grant resource requests even if it starts.
 Patch:
 I attached a patch that implements this behavior. But this implementation has 
 this trade-off:
 * When AM is registered to FairScheduler, its demand is 0 because even AM 
 attempt is not created. Starting this AM doesn't change resource demand of a 
 queue. So, if many AMs are issued to a queue at the same time, all AMs will 
 be RUNNING. But we want to prevent it.
 * When a AM starts, demand of the AM is only AM attempt. Then AM requires 
 more resources. Until AM requires resources, demand of the queue is low. But 
 starting AM during this time will start unnecessary AMs. 
 * So, this patch doesn't start immediately when AM is registered. Instead, it 
 starts AM only every continuous-scheduling-sleep-ms.
 * Setting large continuous-scheduling-sleep-ms will prevent wasting AMs. But 
 increases latency.
 Therefore, this patch is enabled only if new option 
 demand-blocks-am-enabled is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.

2015-08-16 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3224:
--
Release Note:   (was: We should leverage YARN preemption framework to 
notify AM that some containers will be preempted after a timeout.)

 Notify AM with containers (on decommissioning node) could be preempted after 
 timeout.
 -

 Key: YARN-3224
 URL: https://issues.apache.org/jira/browse/YARN-3224
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Sunil G





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.

2015-08-16 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3224:
--
Description: We should leverage YARN preemption framework to notify AM that 
some containers will be preempted after a timeout.

 Notify AM with containers (on decommissioning node) could be preempted after 
 timeout.
 -

 Key: YARN-3224
 URL: https://issues.apache.org/jira/browse/YARN-3224
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Sunil G

 We should leverage YARN preemption framework to notify AM that some 
 containers will be preempted after a timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4053) Change the way metric values are stored in HBase Storage

2015-08-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698602#comment-14698602
 ] 

Varun Saxena commented on YARN-4053:


This patch demonstrates the approach mentioned above and works for both 
integral and floating point values. But for floating point values, the 
restriction on part of the client is that it should send values in decimal 
format always otherwise when I add metric filters, matching will fail. I guess 
its a fair enough restriction to place.

In the patch, we can indicate that numerical values have to be stored per 
column/column prefix.
We can however extend this logic for all values and indicate if values to be 
stored are ASCII encoded as well, so that different kind of values can be 
stored differently in same column.

But there is no use case for this as of now, so haven't done so. 

I will remove the part about floating point numbers from patch, if we dont want 
it now.



 Change the way metric values are stored in HBase Storage
 

 Key: YARN-4053
 URL: https://issues.apache.org/jira/browse/YARN-4053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-4053-YARN-2928.01.patch


 Currently HBase implementation uses GenericObjectMapper to convert and store 
 values in backend HBase storage. This converts everything into a string 
 representation(ASCII/UTF-8 encoded byte array).
 While this is fine in most cases, it does not quite serve our use case for 
 metrics. 
 So we need to decide how are we going to encode and decode metric values and 
 store them in HBase.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4053) Change the way metric values are stored in HBase Storage

2015-08-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698608#comment-14698608
 ] 

Hadoop QA commented on YARN-4053:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m  3s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 54s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 16s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 26s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 52s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 22s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  38m 58s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12750693/YARN-4053-YARN-2928.01.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / f40c735 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8853/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8853/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8853/console |


This message was automatically generated.

 Change the way metric values are stored in HBase Storage
 

 Key: YARN-4053
 URL: https://issues.apache.org/jira/browse/YARN-4053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Varun Saxena
Assignee: Varun Saxena
 Attachments: YARN-4053-YARN-2928.01.patch


 Currently HBase implementation uses GenericObjectMapper to convert and store 
 values in backend HBase storage. This converts everything into a string 
 representation(ASCII/UTF-8 encoded byte array).
 While this is fine in most cases, it does not quite serve our use case for 
 metrics. 
 So we need to decide how are we going to encode and decode metric values and 
 store them in HBase.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-08-16 Thread Dan Shechter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Shechter updated YARN-3997:
---
Description: 
When our cluster is configured with preemption, and is fully loaded with an 
application consuming 1-core containers, it will not kill off these containers 
when a new application kicks in requesting containers with a size  1, for 
example 4 core containers.

When the second application attempts to us 1-core containers as well, 
preemption proceeds as planned and everything works properly.

It is my assumption, that the fair-scheduler, while recognizing it needs to 
kill off some container to make room for the new application, fails to find a 
SINGLE container satisfying the request for a 4-core container (since all 
existing containers are 1-core containers), and isn't smart enough to realize 
it needs to kill off 4 single-core containers (in this case) on a single node, 
for the new application to be able to proceed...

The exhibited affect is that the new application is hung indefinitely and never 
gets the resources it requires.

This can easily be replicated with any yarn application.
Our goto scenario in this case is running pyspark with 1-core executors 
(containers) while trying to launch h20.ai framework which INSISTS on having at 
least 4 cores per container.

  was:
When our cluster is configures with preemption, and is fully loaded with an 
application consuming 1-core containers, it will not kill off these containers 
when a new application kicks in requesting, for example 4 core containers.

When the second application attempts to us 1-core containers as well, 
preemption proceeds as planned and everything works properly.

It is my assumptiom, that the fair-scheduler, while recognizing it needs to 
kill off some container to make room for the new application, fails to find a 
SINGLE container satisfying the request for a 4-core container (since all 
existing containers are 1-core containers), and isn't smart enough to realize 
it needs to kill off 4 single-core containers (in this case) on a single node, 
for the new application to be able to proceed...

The exhibited affect is that the new application is hung indefinitely and never 
gets the resources it requires.

This can easily be replicated with any yarn application.
Our goto scenario in this case is running pyspark with 1-core executors 
(containers) while trying to launch h20.ai framework which INSISTS on having at 
least 4 cores per container.


 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Karthik Kambatla
Priority: Critical

 When our cluster is configured with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting containers with a size 
  1, for example 4 core containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumption, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3534) Collect memory/cpu usage on the node

2015-08-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698663#comment-14698663
 ] 

Karthik Kambatla commented on YARN-3534:


+1

 Collect memory/cpu usage on the node
 

 Key: YARN-3534
 URL: https://issues.apache.org/jira/browse/YARN-3534
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Attachments: YARN-3534-1.patch, YARN-3534-10.patch, 
 YARN-3534-11.patch, YARN-3534-12.patch, YARN-3534-14.patch, 
 YARN-3534-15.patch, YARN-3534-16.patch, YARN-3534-16.patch, 
 YARN-3534-17.patch, YARN-3534-17.patch, YARN-3534-18.patch, 
 YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, 
 YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, 
 YARN-3534-9.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 YARN should be aware of the resource utilization of the nodes when scheduling 
 containers. For this, this task will implement the collection of memory/cpu 
 usage on the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-08-16 Thread Chen Avnery (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698648#comment-14698648
 ] 

Chen Avnery commented on YARN-3997:
---

This has happened to me too and I would love for it to have a fix!

 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Karthik Kambatla
Priority: Critical

 When our cluster is configured with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting containers with a size 
  1, for example 4 core containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumption, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3534) Collect memory/cpu usage on the node

2015-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698674#comment-14698674
 ] 

Hudson commented on YARN-3534:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #286 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/286/])
YARN-3534. Collect memory/cpu usage on the node. (Inigo Goiri via kasha) 
(kasha: rev def12933b38efd5e47c5144b729c1a1496f09229)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeResourceMonitor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeResourceMonitor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeResourceMonitorImpl.java


 Collect memory/cpu usage on the node
 

 Key: YARN-3534
 URL: https://issues.apache.org/jira/browse/YARN-3534
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-3534-1.patch, YARN-3534-10.patch, 
 YARN-3534-11.patch, YARN-3534-12.patch, YARN-3534-14.patch, 
 YARN-3534-15.patch, YARN-3534-16.patch, YARN-3534-16.patch, 
 YARN-3534-17.patch, YARN-3534-17.patch, YARN-3534-18.patch, 
 YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, 
 YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, 
 YARN-3534-9.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 YARN should be aware of the resource utilization of the nodes when scheduling 
 containers. For this, this task will implement the collection of memory/cpu 
 usage on the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-08-16 Thread Ilan Assayag (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698709#comment-14698709
 ] 

Ilan Assayag commented on YARN-3997:


Got exactly the same issue too. Very painful...

 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Karthik Kambatla
Priority: Critical

 When our cluster is configured with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting containers with a size 
  1, for example 4 core containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumption, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2467) Add SpanReceiverHost to ResourceManager

2015-08-16 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698730#comment-14698730
 ] 

Masatake Iwasaki commented on YARN-2467:


The failures in test seem not to be related to the patch. 
TestContainerAllocation succeeded on my local environment.

 Add SpanReceiverHost to ResourceManager
 ---

 Key: YARN-2467
 URL: https://issues.apache.org/jira/browse/YARN-2467
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
 Attachments: YARN-2467.001.patch, YARN-2467.002.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2467) Add SpanReceiverHost to ResourceManager

2015-08-16 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-2467:
---
Description: Per process SpanReceiverHost should be initialized in 
ResourceManager in the same way as NameNode and DataNode do in order to support 
tracing.

 Add SpanReceiverHost to ResourceManager
 ---

 Key: YARN-2467
 URL: https://issues.apache.org/jira/browse/YARN-2467
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
 Attachments: YARN-2467.001.patch, YARN-2467.002.patch


 Per process SpanReceiverHost should be initialized in ResourceManager in the 
 same way as NameNode and DataNode do in order to support tracing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3534) Collect memory/cpu usage on the node

2015-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698667#comment-14698667
 ] 

Hudson commented on YARN-3534:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8311 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8311/])
YARN-3534. Collect memory/cpu usage on the node. (Inigo Goiri via kasha) 
(kasha: rev def12933b38efd5e47c5144b729c1a1496f09229)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeResourceMonitorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeResourceMonitor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeResourceMonitor.java


 Collect memory/cpu usage on the node
 

 Key: YARN-3534
 URL: https://issues.apache.org/jira/browse/YARN-3534
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-3534-1.patch, YARN-3534-10.patch, 
 YARN-3534-11.patch, YARN-3534-12.patch, YARN-3534-14.patch, 
 YARN-3534-15.patch, YARN-3534-16.patch, YARN-3534-16.patch, 
 YARN-3534-17.patch, YARN-3534-17.patch, YARN-3534-18.patch, 
 YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, 
 YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, 
 YARN-3534-9.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 YARN should be aware of the resource utilization of the nodes when scheduling 
 containers. For this, this task will implement the collection of memory/cpu 
 usage on the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-08-16 Thread Uri Miron (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698734#comment-14698734
 ] 

Uri Miron commented on YARN-3997:
-

I am getting the same issue.

 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Karthik Kambatla
Priority: Critical

 When our cluster is configured with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting containers with a size 
  1, for example 4 core containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumption, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2467) Add SpanReceiverHost to ResourceManager

2015-08-16 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698733#comment-14698733
 ] 

Masatake Iwasaki commented on YARN-2467:


The TraceAdmin added to AdminService is for online tracing configuration update 
by {{hadoop trace}} commmand as supported by NameNode and DataNode. The 
{{hadoop trace}} command requires users to specify hostport string of the 
target IPC server in order to support usecases in which tracing is enabled on 
specific slave server only.

Though I added TraceAdminProtocol to AdminService in the same way as 
HAServiceProtocol do, NodeManager does not have ipc server for administration. 
I think it is ok to remove the TraceAdmin feature from ResourceManager as a 
starting point.

 Add SpanReceiverHost to ResourceManager
 ---

 Key: YARN-2467
 URL: https://issues.apache.org/jira/browse/YARN-2467
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
 Attachments: YARN-2467.001.patch, YARN-2467.002.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4049) Add SpanReceiverHost to NodeManager

2015-08-16 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-4049:
---
Attachment: YARN-4049.001.patch

I'm attaching wip patch. The 001 can not be applied to trunk because it depends 
on YARN-2467. If the patch attached to YARN-2467 is updated before committing, 
I will fix here later.

 Add SpanReceiverHost to NodeManager
 ---

 Key: YARN-4049
 URL: https://issues.apache.org/jira/browse/YARN-4049
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
 Attachments: YARN-4049.001.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3534) Collect memory/cpu usage on the node

2015-08-16 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698806#comment-14698806
 ] 

Inigo Goiri commented on YARN-3534:
---

Thank you [~kasha] for taking care of the review and the commits.
I'll be moving to propagating this info to the scheduler.

 Collect memory/cpu usage on the node
 

 Key: YARN-3534
 URL: https://issues.apache.org/jira/browse/YARN-3534
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-3534-1.patch, YARN-3534-10.patch, 
 YARN-3534-11.patch, YARN-3534-12.patch, YARN-3534-14.patch, 
 YARN-3534-15.patch, YARN-3534-16.patch, YARN-3534-16.patch, 
 YARN-3534-17.patch, YARN-3534-17.patch, YARN-3534-18.patch, 
 YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, 
 YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, 
 YARN-3534-9.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 YARN should be aware of the resource utilization of the nodes when scheduling 
 containers. For this, this task will implement the collection of memory/cpu 
 usage on the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-08-16 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698927#comment-14698927
 ] 

Hong Zhiguo commented on YARN-4024:
---

That's a good reason to have this cache.
[~leftnoteasy],  in earlier comments, you said
{code}
1) If a host_a, has IP=IP1, IP1 is on whitelist. If we change the IP of host_a 
to IP2, IP2 is in blacklist. We won't do the re-resolve since the cached IP1 is 
on whitelist.
2) If a host_a, has IP=IP1, IP1 is on blacklist. We may need to do re-resolve 
every time when the node doing heartbeat since it may change to its IP to a one 
not on the blacklist.
{code}
I think that's too complicated. The cache lookup is a part of resolving (name 
to address). And the check of IP whitelist/blacklist is just the following 
stage. I think cache with configurable expiration is enough, we'd better leave 
the 2 stages orthogonal, not to mix them up.

BTW, I think it's not good to have Name in NodeId, but Address in 
whitelist/blacklist. Different layers of abstraction are mixed up.  We'll don't 
have this issue if Name or Address is used for both NodeId and 
whitelist/blacklist.
a better way is to have Name in whitelist/blacklist, instead of Address. 



 YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
 --

 Key: YARN-4024
 URL: https://issues.apache.org/jira/browse/YARN-4024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan
Assignee: Hong Zhiguo

 Currently, YARN RM NodesListManager will resolve IP address every time when 
 node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
 blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-08-16 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698929#comment-14698929
 ] 

Hong Zhiguo commented on YARN-4024:
---

Please ignore the last sentence a better way is to have Name in 
whitelist/blacklist, instead of Address. Or could someone help to delete it.

 YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
 --

 Key: YARN-4024
 URL: https://issues.apache.org/jira/browse/YARN-4024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan
Assignee: Hong Zhiguo

 Currently, YARN RM NodesListManager will resolve IP address every time when 
 node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
 blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Srikanth Kandula (JIRA)
Srikanth Kandula created YARN-4056:
--

 Summary: Bundling: Searching for multiple containers in a single 
pass over {queues, applications, priorities}
 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula


More than one container is allocated on many NM heartbeats. Yet, the current 
scheduler allocates exactly one container per iteration over {queues, 
applications, priorities}. When there are many queues, applications, or 
priorities allocating only one container per iteration can  needlessly increase 
the duration of the NM heartbeat.
 
In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
to be allocated in a single iteration over {queues, applications and 
priorities}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3901) Populate flow run data in the flow_run table

2015-08-16 Thread Vrushali C (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C updated YARN-3901:
-
Attachment: YARN-3901-YARN-2928.WIP.patch


Uploading a work in progress patch. This patch is not yet rebased against the 
new branch. I would like to finish the patch and then rebase it.

- it has some new classes that deal with flow run table. 
- it adds in cell level tags to the cells being stored in the flow run table. 
- it has a coprocessor class that currently handles the put (prePut) and scan 
(preScannerOpen and postScannerOpen) operations. 
- it has a new AggregationScanner class that is invoked from the preprocessor, 
so any scans that hit this table will effectively go through via the 
AggregationScanner class methods
- the start time for a flow is defined as the lowest amongst the start times of 
all applications in that flow run. Similarly the end  time for a flow is 
defined as the biggest amongst the start times of all applications in that flow 
run. These are stored per flow run upon application creation and application 
finish events. The coprocessor prePut intercepts these and puts in only the 
right values. 
- for metrics, all metrics are stored as they come in. When a metric for a flow 
run is to be read back, a special scanner is used. This scanner reads all cells 
for that metric that belong to all applications. Only the latest cell per 
application is picked and added up to form a metric value for that flow run. 
The application states are also stored as running and finished in the cell tag 
for metrics

- TODO: 
- Working next to add in the get (very similar to scan), flush, compact 
operations.
- The applications that have finished can be compacted and a per flow run 
metric cell can be created.
- The finished application cells can be then removed


 Populate flow run data in the flow_run table
 

 Key: YARN-3901
 URL: https://issues.apache.org/jira/browse/YARN-3901
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Vrushali C
Assignee: Vrushali C
 Attachments: YARN-3901-YARN-2928.WIP.patch


 As per the schema proposed in YARN-3815 in 
 https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
 filing jira to track creation and population of data in the flow run table. 
 Some points that are being  considered:
 - Stores per flow run information aggregated across applications, flow version
 RM’s collector writes to on app creation and app completion
 - Per App collector writes to it for metric updates at a slower frequency 
 than the metric updates to application table
 primary key: cluster ! user ! flow ! flow run id
 - Only the latest version of flow-level aggregated metrics will be kept, even 
 if the entity and application level keep a timeseries.
 - The running_apps column will be incremented on app creation, and 
 decremented on app completion.
 - For min_start_time the RM writer will simply write a value with the tag for 
 the applicationId. A coprocessor will return the min value of all written 
 values. - 
 - Upon flush and compactions, the min value between all the cells of this 
 column will be written to the cell without any tag (empty tag) and all the 
 other cells will be discarded.
 - Ditto for the max_end_time, but then the max will be kept.
 - Tags are represented as #type:value. The type can be not set (0), or can 
 indicate running (1) or complete (2). In those cases (for metrics) only 
 complete app metrics are collapsed on compaction.
 - The m! values are aggregated (summed) upon read. Only when applications are 
 completed (indicated by tag type 2) can the values be collapsed.
 - The application ids that have completed and been aggregated into the flow 
 numbers are retained in a separate column for historical tracking: we don’t 
 want to re-aggregate for those upon replay
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Srikanth Kandula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698925#comment-14698925
 ] 

Srikanth Kandula commented on YARN-4056:


Will look. Possibly. However, this arch allows any bundling policy. We will 
push through a couple different bundled policies. I suspect the 
packer+dependencies+bounded unfairness bundled will be novel.

 Bundling: Searching for multiple containers in a single pass over {queues, 
 applications, priorities}
 

 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula
Assignee: Robert Grandl
 Attachments: bundling.docx


 More than one container is allocated on many NM heartbeats. Yet, the current 
 scheduler allocates exactly one container per iteration over {{queues, 
 applications, priorities}}. When there are many queues, applications, or 
 priorities allocating only one container per iteration can  needlessly 
 increase the duration of the NM heartbeat.
  
 In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
 to be allocated in a single iteration over {{queues, applications and 
 priorities}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Srikanth Kandula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Kandula updated YARN-4056:
---
Description: 
More than one container is allocated on many NM heartbeats. Yet, the current 
scheduler allocates exactly one container per iteration over {{queues, 
applications, priorities}}. When there are many queues, applications, or 
priorities allocating only one container per iteration can  needlessly increase 
the duration of the NM heartbeat.
 
In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
to be allocated in a single iteration over {queues, applications and 
priorities}.

  was:
More than one container is allocated on many NM heartbeats. Yet, the current 
scheduler allocates exactly one container per iteration over {queues, 
applications, priorities}. When there are many queues, applications, or 
priorities allocating only one container per iteration can  needlessly increase 
the duration of the NM heartbeat.
 
In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
to be allocated in a single iteration over {queues, applications and 
priorities}.


 Bundling: Searching for multiple containers in a single pass over {queues, 
 applications, priorities}
 

 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula
 Attachments: bundling.docx


 More than one container is allocated on many NM heartbeats. Yet, the current 
 scheduler allocates exactly one container per iteration over {{queues, 
 applications, priorities}}. When there are many queues, applications, or 
 priorities allocating only one container per iteration can  needlessly 
 increase the duration of the NM heartbeat.
  
 In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
 to be allocated in a single iteration over {queues, applications and 
 priorities}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Srikanth Kandula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Kandula updated YARN-4056:
---
Attachment: bundling.docx

 Bundling: Searching for multiple containers in a single pass over {queues, 
 applications, priorities}
 

 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula
 Attachments: bundling.docx


 More than one container is allocated on many NM heartbeats. Yet, the current 
 scheduler allocates exactly one container per iteration over {queues, 
 applications, priorities}. When there are many queues, applications, or 
 priorities allocating only one container per iteration can  needlessly 
 increase the duration of the NM heartbeat.
  
 In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
 to be allocated in a single iteration over {queues, applications and 
 priorities}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Srikanth Kandula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Kandula updated YARN-4056:
---
Description: 
More than one container is allocated on many NM heartbeats. Yet, the current 
scheduler allocates exactly one container per iteration over {{queues, 
applications, priorities}}. When there are many queues, applications, or 
priorities allocating only one container per iteration can  needlessly increase 
the duration of the NM heartbeat.
 
In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
to be allocated in a single iteration over {{queues, applications and 
priorities}}.

  was:
More than one container is allocated on many NM heartbeats. Yet, the current 
scheduler allocates exactly one container per iteration over {{queues, 
applications, priorities}}. When there are many queues, applications, or 
priorities allocating only one container per iteration can  needlessly increase 
the duration of the NM heartbeat.
 
In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
to be allocated in a single iteration over {queues, applications and 
priorities}.


 Bundling: Searching for multiple containers in a single pass over {queues, 
 applications, priorities}
 

 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula
 Attachments: bundling.docx


 More than one container is allocated on many NM heartbeats. Yet, the current 
 scheduler allocates exactly one container per iteration over {{queues, 
 applications, priorities}}. When there are many queues, applications, or 
 priorities allocating only one container per iteration can  needlessly 
 increase the duration of the NM heartbeat.
  
 In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
 to be allocated in a single iteration over {{queues, applications and 
 priorities}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Robert Grandl (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl reassigned YARN-4056:
---

Assignee: Robert Grandl

 Bundling: Searching for multiple containers in a single pass over {queues, 
 applications, priorities}
 

 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula
Assignee: Robert Grandl
 Attachments: bundling.docx


 More than one container is allocated on many NM heartbeats. Yet, the current 
 scheduler allocates exactly one container per iteration over {{queues, 
 applications, priorities}}. When there are many queues, applications, or 
 priorities allocating only one container per iteration can  needlessly 
 increase the duration of the NM heartbeat.
  
 In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
 to be allocated in a single iteration over {{queues, applications and 
 priorities}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-08-16 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698908#comment-14698908
 ] 

Rohith Sharma K S commented on YARN-3893:
-

Sorry for coming very late.. This issue has become stale, need to move forward!!
Regarding the patch, 
# Instead of setting boolean flag for reinitActiveServices in AdminService and 
other changes, moving {{createAndInitActiveServices();}} from 
transitionedToStandby to just before starting activeServices would solve such 
issues. And on exception transitioningToActive, handle add method 
stopActiveServices in ResourceManager#transitioningToActive() only. 
# Probably we can remove refreshAll() from AdminService#transitioneToActive if 
the above approach.

Any thoughts?

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4056) Bundling: Searching for multiple containers in a single pass over {queues, applications, priorities}

2015-08-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698915#comment-14698915
 ] 

Karthik Kambatla commented on YARN-4056:


Is this similar to {{assignMultiple}} in {{FairScheduler}}?

 Bundling: Searching for multiple containers in a single pass over {queues, 
 applications, priorities}
 

 Key: YARN-4056
 URL: https://issues.apache.org/jira/browse/YARN-4056
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Srikanth Kandula
Assignee: Robert Grandl
 Attachments: bundling.docx


 More than one container is allocated on many NM heartbeats. Yet, the current 
 scheduler allocates exactly one container per iteration over {{queues, 
 applications, priorities}}. When there are many queues, applications, or 
 priorities allocating only one container per iteration can  needlessly 
 increase the duration of the NM heartbeat.
  
 In this JIRA, we propose bundling. That is, allow arbitrarily many containers 
 to be allocated in a single iteration over {{queues, applications and 
 priorities}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: YARN-3980-v1.patch

Fixing broken unit tests.

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: YARN-3980-v1.patch

Whitespace fix.

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, 
 YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: (was: YARN-3980-v1.patch)

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698955#comment-14698955
 ] 

Hadoop QA commented on YARN-3980:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 19s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:red}-1{color} | javac |   3m 35s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12750738/YARN-3980-v0.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 13604bd |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8855/console |


This message was automatically generated.

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699003#comment-14699003
 ] 

Hadoop QA commented on YARN-3980:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 45s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 4 new or modified test files. |
| {color:red}-1{color} | javac |   8m 20s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12750745/YARN-3980-v1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 13604bd |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8856/console |


This message was automatically generated.

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: (was: YARN-3980-v1.patch)

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: YARN-3980-v1.patch

Fixed SLS.

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-08-16 Thread Dan Shechter (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698830#comment-14698830
 ] 

Dan Shechter commented on YARN-3997:


Hi,
I was trying to find the existing unit-tests for the Fair-Scheduler 
preeption... All I could find was this:
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java

Are the more tests hiding somewhere else?

 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Arun Suresh
Priority: Critical

 When our cluster is configured with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting containers with a size 
  1, for example 4 core containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumption, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-4055:
--
Attachment: YARN-4055-v0.patch

First version for sending the node resource utilization in the heartbeat.

 Report node resource utilization in heartbeat
 -

 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-4055-v0.patch


 Send the resource utilization from the node (obtained in the 
 NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698863#comment-14698863
 ] 

Karthik Kambatla commented on YARN-4055:


+1

 Report node resource utilization in heartbeat
 -

 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-4055-v0.patch, YARN-4055-v1.patch


 Send the resource utilization from the node (obtained in the 
 NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Inigo Goiri (JIRA)
Inigo Goiri created YARN-4055:
-

 Summary: Report node resource utilization in heartbeat
 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0


Send the resource utilization from the node (obtained in the 
NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-4055:
--
Attachment: YARN-4055-v1.patch

Changing type for node resource monitor in Node Manager.

 Report node resource utilization in heartbeat
 -

 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-4055-v0.patch, YARN-4055-v1.patch


 Send the resource utilization from the node (obtained in the 
 NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-08-16 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-3997:
--

Assignee: Arun Suresh  (was: Karthik Kambatla)

Was discussing this with [~asuresh] offline, and he wanted to take this up. 

 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Arun Suresh
Priority: Critical

 When our cluster is configured with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting containers with a size 
  1, for example 4 core containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumption, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698849#comment-14698849
 ] 

Karthik Kambatla commented on YARN-4055:


Thanks for filing and working on this, Inigo. Patch looks mostly good, but for 
one minor comment:
# Looks like NodeManager#createNodeResourceMonitor could just return 
NodeResourceMonitor instead of NodeResourceMonitorImpl

 Report node resource utilization in heartbeat
 -

 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-4055-v0.patch


 Send the resource utilization from the node (obtained in the 
 NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698860#comment-14698860
 ] 

Hadoop QA commented on YARN-4055:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 36s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 11s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 48s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 23s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m  9s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   7m  2s | Tests failed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |   6m 12s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  56m 21s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.client.api.impl.TestYarnClient |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12750721/YARN-4055-v1.patch |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | trunk / def1293 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8854/console |


This message was automatically generated.

 Report node resource utilization in heartbeat
 -

 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-4055-v0.patch, YARN-4055-v1.patch


 Send the resource utilization from the node (obtained in the 
 NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4055) Report node resource utilization in heartbeat

2015-08-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698866#comment-14698866
 ] 

Hudson commented on YARN-4055:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8312 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8312/])
YARN-4055. Report node resource utilization in heartbeat. (Inigo Goiri via 
kasha) (kasha: rev 13604bd5f119fc81b9942190dfa366afad61bc92)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/records/impl/pb/NodeStatusPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/records/NodeStatus.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* hadoop-yarn-project/CHANGES.txt


 Report node resource utilization in heartbeat
 -

 Key: YARN-4055
 URL: https://issues.apache.org/jira/browse/YARN-4055
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Fix For: 2.8.0

 Attachments: YARN-4055-v0.patch, YARN-4055-v1.patch


 Send the resource utilization from the node (obtained in the 
 NodeResourceMonitor) to the RM in the heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: YARN-3980-v0.patch

First version missing unit test based on MiniYARNCluster (WIP).

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: YARN-3980-v0.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3980:
---
Assignee: Inigo Goiri  (was: Karthik Kambatla)

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Inigo Goiri
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4057) If ContainersMonitor is not enabled, only print related log info one time

2015-08-16 Thread Jun Gong (JIRA)
Jun Gong created YARN-4057:
--

 Summary: If ContainersMonitor is not enabled, only print related 
log info one time
 Key: YARN-4057
 URL: https://issues.apache.org/jira/browse/YARN-4057
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Jun Gong
Assignee: Jun Gong
Priority: Minor


ContainersMonitorImpl will check whether it is enabled when handling every 
event,  and it will print following messages again and again if not enabled:

{quote}
2015-08-17 13:20:13,792 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Neither virutal-memory nor physical-memory is needed. Not running the 
monitor-thread
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699042#comment-14699042
 ] 

Karthik Kambatla commented on YARN-3980:


Barely skimmed through the patch. In ResourceTrackerService, when creating the 
NodeStatusEvent, should we just include remoteNodeStatus instead of each of its 
members? 

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Inigo Goiri
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-08-16 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699046#comment-14699046
 ] 

Inigo Goiri commented on YARN-3980:
---

It would change the previous code a lot but I think it would be cleaner. I can 
do a proposal with that.

 Plumb resource-utilization info in node heartbeat through to the scheduler
 --

 Key: YARN-3980
 URL: https://issues.apache.org/jira/browse/YARN-3980
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Karthik Kambatla
Assignee: Inigo Goiri
 Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch


 YARN-1012 and YARN-3534 collect resource utilization information for all 
 containers and the node respectively and send it to the RM on node heartbeat. 
 We should plumb it through to the scheduler so the scheduler can make use of 
 it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)