[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-09 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Description: 
Trying to run application on Nodelabel partition I  found that the application 
execution time is delayed by 5 – 10 min for 500 containers . Total 3 machines 2 
machines were in same partition and app submitted to same.

After enabling debug was able to find the below

# From AM the container ask is for OFF-SWITCH
# RM allocating all containers to NODE_LOCAL as shown in logs below.
# So since I was having about 500 containers time taken was about – 6 minutes 
to allocate 1st map after AM allocation.
# Tested with about 1K maps using PI job took 17 minutes to allocate  next 
container after AM allocation

Once 500 container allocation on NODE_LOCAL is done the next container 
allocation is done on OFF_SWITCH

{code}
2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
/default-rack, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: *, Relax 
Locality: true, Node Label Expression: 3}

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
host-10-19-92-143, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
host-10-19-92-117, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL
{code}
 
{code}
2015-09-09 14:35:45,467 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:45,831 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:46,469 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:46,832 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

{code}

{code}
dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
 cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
"root.b.b1" | wc -l

500
{code}
 

(Consumes about 6 minutes)

 

  was:
Trying to run application on Nodelabel partition I  found that the application 
execution time is delayed by 5 – 10 min for 500 containers . Total 3 machines 2 
machines were in same partition and app submitted to same.

After enabling debug was able to find the below

# From AM the container ask is for OFF-SWITCH
# RM allocating all containers to NODE_LOCAL as shown in logs below.
# So since I was having about 500 containers time taken was about – 6 minutes 
to allocated map after AM allocation.
# Tested with about 1K maps using PI job took 17 minutes to allocate  next 
container after AM allocation

Once 500 container allocation on NODE_LOCAL is done the next container 
allocation is done on OFF_SWITCH

{code}
2015-09-09 15:21:58,954 DEBUG 

[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode

2015-09-11 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740903#comment-14740903
 ] 

Bibin A Chundatt commented on YARN-4126:


Hi [~jianhe]

Thanks for the suggestion

Tested with your suggestion too still its failing. In 
{{UserGroupConfiguration}} will be loaded with non secure mode conf. Tried 
setting in BeforeClass too. Then all other nonsecure will try to load keytab 
file will fail.
Can refactor the testclass by spliting token related testcases to different 
test class.That should be fine rt?

> RM should not issue delegation tokens in unsecure mode
> --
>
> Key: YARN-4126
> URL: https://issues.apache.org/jira/browse/YARN-4126
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, 
> 0003-YARN-4126.patch, 0004-YARN-4126.patch, 0005-YARN-4126.patch
>
>
> ClientRMService#getDelegationToken is currently  returning a delegation token 
> in insecure mode. We should not return the token if it's in insecure mode. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4136) LinuxContainerExecutor loses info when forwarding ResourceHandlerException

2015-09-09 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737989#comment-14737989
 ] 

Bibin A Chundatt commented on YARN-4136:


I feel testcases are not required for this patch

> LinuxContainerExecutor loses info when forwarding ResourceHandlerException
> --
>
> Key: YARN-4136
> URL: https://issues.apache.org/jira/browse/YARN-4136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Trivial
> Attachments: 0001-YARN-4136.patch
>
>
> The Linux container executor {{launchContainer}} method throws 
> {{ResourceHandlerException}} when there are problems setting up the container 
> -but these aren't propagated in the raised IOE. They should be nested with 
> the string value included in the message text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-09 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4106:
---
Attachment: 0008-YARN-4106.patch

> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> ---
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch, 
> 0006-YARN-4106.patch, 0007-YARN-4106.patch, 0008-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 
> *This jira also handles the below issue too*
> Timer Task not getting triggered in Nodemanager for Label update in 
> nodemanager for distributed scheduling
> Task is supposed to trigger every 
> {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode

2015-09-11 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740970#comment-14740970
 ] 

Bibin A Chundatt commented on YARN-4126:


Have tried out separating out i find this the best way to do it

> RM should not issue delegation tokens in unsecure mode
> --
>
> Key: YARN-4126
> URL: https://issues.apache.org/jira/browse/YARN-4126
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, 
> 0003-YARN-4126.patch, 0004-YARN-4126.patch, 0005-YARN-4126.patch
>
>
> ClientRMService#getDelegationToken is currently  returning a delegation token 
> in insecure mode. We should not return the token if it's in insecure mode. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4126) RM should not issue delegation tokens in unsecure mode

2015-09-11 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4126:
---
Attachment: 0006-YARN-4126.patch

Hi [~jianhe]

Attaching patch after refactoring testcases.Please do review
All testcases are passing fine.

> RM should not issue delegation tokens in unsecure mode
> --
>
> Key: YARN-4126
> URL: https://issues.apache.org/jira/browse/YARN-4126
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, 
> 0003-YARN-4126.patch, 0004-YARN-4126.patch, 0005-YARN-4126.patch, 
> 0006-YARN-4126.patch
>
>
> ClientRMService#getDelegationToken is currently  returning a delegation token 
> in insecure mode. We should not return the token if it's in insecure mode. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-13 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0002-YARN-4140.patch

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> {code}
>  
> (Consumes about 6 minutes)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-14 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743001#comment-14743001
 ] 

Bibin A Chundatt commented on YARN-4140:


Hi [~Naganarasimha]
Thnks for the comment missed it. Have updated patch as per comments.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> {code}
>  
> (Consumes about 6 minutes)
>  



--
This message was sent by Atlassian 

[jira] [Created] (YARN-4152) NM crash when LogAggregationService#stopContainer called for absent container

2015-09-12 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4152:
--

 Summary: NM crash when LogAggregationService#stopContainer called 
for absent container
 Key: YARN-4152
 URL: https://issues.apache.org/jira/browse/YARN-4152
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical


NM crash during of log aggregation.

*Logs*
{code}
2015-09-12 18:44:25,597 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_e51_1442063466801_0001_01_99 is : 143
2015-09-12 18:44:25,670 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Event EventType: KILL_CONTAINER sent to absent container 
container_e51_1442063466801_0001_01_000101
2015-09-12 18:44:25,670 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Removing container_e51_1442063466801_0001_01_000101 from application 
application_1442063466801_0001
2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2015-09-12 18:44:25,692 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_STOP for appId application_1442063466801_0001
2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Exiting, bbye..
2015-09-12 18:44:25,692 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
OPERATION=Container Finished - SucceededTARGET=ContainerImpl
RESULT=SUCCESS  APPID=application_1442063466801_0001
CONTAINERID=container_e51_1442063466801_0001_01_000100

{code}

*Analysis*

Looks like for absent container also {{stopContainer}} is called 
{code}
  case CONTAINER_FINISHED:
LogHandlerContainerFinishedEvent containerFinishEvent =
(LogHandlerContainerFinishedEvent) event;
stopContainer(containerFinishEvent.getContainerId(),
containerFinishEvent.getExitCode());
break;
{code}

Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4152) NM crash when LogAggregationService#stopContainer called for absent container

2015-09-12 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4152:
---
Description: 
NM crash during of log aggregation.
Ran Pi job with 500 container and killed application in between

*Logs*
{code}
2015-09-12 18:44:25,597 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_e51_1442063466801_0001_01_99 is : 143
2015-09-12 18:44:25,670 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Event EventType: KILL_CONTAINER sent to absent container 
container_e51_1442063466801_0001_01_000101
2015-09-12 18:44:25,670 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Removing container_e51_1442063466801_0001_01_000101 from application 
application_1442063466801_0001
2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2015-09-12 18:44:25,692 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_STOP for appId application_1442063466801_0001
2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Exiting, bbye..
2015-09-12 18:44:25,692 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
OPERATION=Container Finished - SucceededTARGET=ContainerImpl
RESULT=SUCCESS  APPID=application_1442063466801_0001
CONTAINERID=container_e51_1442063466801_0001_01_000100

{code}

*Analysis*

Looks like for absent container also {{stopContainer}} is called 
{code}
  case CONTAINER_FINISHED:
LogHandlerContainerFinishedEvent containerFinishEvent =
(LogHandlerContainerFinishedEvent) event;
stopContainer(containerFinishEvent.getContainerId(),
containerFinishEvent.getExitCode());
break;
{code}

*Event EventType: KILL_CONTAINER sent to absent container 
container_e51_1442063466801_0001_01_000101*

Should skip when {{null==context.getContainers().get(containerId)}} 

  was:
NM crash during of log aggregation.

*Logs*
{code}
2015-09-12 18:44:25,597 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_e51_1442063466801_0001_01_99 is : 143
2015-09-12 18:44:25,670 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Event EventType: KILL_CONTAINER sent to absent container 
container_e51_1442063466801_0001_01_000101
2015-09-12 18:44:25,670 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Removing container_e51_1442063466801_0001_01_000101 from application 
application_1442063466801_0001
2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2015-09-12 18:44:25,692 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_STOP for appId application_1442063466801_0001
2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Exiting, bbye..
2015-09-12 18:44:25,692 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
OPERATION=Container Finished - SucceededTARGET=ContainerImpl
RESULT=SUCCESS  APPID=application_1442063466801_0001
CONTAINERID=container_e51_1442063466801_0001_01_000100

{code}

*Analysis*

Looks like for absent container also {{stopContainer}} is called 
{code}
  case 

[jira] [Commented] (YARN-4152) NM crash when LogAggregationService#stopContainer called for absent container

2015-09-12 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742082#comment-14742082
 ] 

Bibin A Chundatt commented on YARN-4152:


LogAggregationService#stopContainer
{code}
ContainerType containerType = context.getContainers().get(
containerId).getContainerTokenIdentifier().getContainerType();
{code}

> NM crash when LogAggregationService#stopContainer called for absent container
> -
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called 
> {code}
>   case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-26 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4155:
---
Attachment: 0003-YARN-4155.patch

Hi [~ste...@apache.org]

Looked into the issue again seems timing issue.

{{TestLogAggregationService#numOfLogsAvailable}} is checked immediately on 
{{aggregator.doLogAggregationOutOfBand()}} call .
When logaggregation is in progress file name will be with extension *.tmp* 

{{TestLogAggregationService#numOfLogsAvailable}} 
{code}
if (filename.contains(LogAggregationUtils.TMP_FILE_SUFFIX)
  || (lastLogFile != null && filename.contains(lastLogFile)
  && sizeLimited)) {
LOG.info("fileName :" + filename);
LOG.info("lastLogFile :" + lastLogFile);
return -1;
  }
{code}
Returns -1
Attaching patch based on the same.

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch, 
> 0003-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-28 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933636#comment-14933636
 ] 

Bibin A Chundatt commented on YARN-4140:


Checkstyle issues are not from modified lines of code and Testcase failure are 
unrelated

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> 

[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-30 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0013-YARN-4140.patch

Hi [~leftnoteasy]

Thanks for the review and comments.
Have updated patched based on comments.

# Comment 1: Got confused with resource request asks.Only the last Any request 
label need to be stored for decreasing usages in
{code}
 if (updatePendingResources) {
   ...
}
{code}
# Comment 2: Udpated the variables name as per comments.
# Comment 3: Only when change happens wanted to change the request thats the 
reason added the same.
# Comment 4: Faced Fair and Fifo related testcase failures when request is 
label was null.Currently the same is not happening.



> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch, 0013-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 

[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-28 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933471#comment-14933471
 ] 

Bibin A Chundatt commented on YARN-4155:


Please review patch attached

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch, 
> 0003-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-28 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0012-YARN-4140.patch

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | 

[jira] [Created] (YARN-4216) Containers logs not shown for containers assigned on NM after recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4216:
--

 Summary: Containers logs not shown for containers assigned on NM 
after recovery
 Key: YARN-4216
 URL: https://issues.apache.org/jira/browse/YARN-4216
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical


Steps to reproduce

# Start 2 nodemanagers  with NM recovery enabled
# Submit pi job with 20 maps 
# Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
(Logs of all completed container gets aggregated to HDFS)
# Now start  the NM1 again and wait for job completion
*The newly assigned container logs on NM1 are not shown*

*hdfs log dir state*

# When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
# On log aggregation after starting NM the newly assigned container logs gets 
uploaded with name  (localhost_38153.tmp) 

History server the logs are now shown for new task attempts





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4216:
---
Summary: Container logs not shown for newly assigned containers  after NM  
recovery  (was: Containers logs not shown for containers assigned on NM after 
recovery)

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4216) Containers logs not shown for containers assigned on NM after recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4216:
---
Attachment: NMLog

> Containers logs not shown for containers assigned on NM after recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4216) Containers logs not shown for containers assigned on NM after recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4216:
---
Component/s: nodemanager
 log-aggregation

> Containers logs not shown for containers assigned on NM after recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4216) Containers logs not shown for containers assigned on NM after recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4216:
---
Attachment: yarn-site.xml

> Containers logs not shown for containers assigned on NM after recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4216) Containers logs not shown for containers assigned on NM after recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4216:
---
Attachment: ScreenshotFolder.png

> Containers logs not shown for containers assigned on NM after recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: ScreenshotFolder.png
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939924#comment-14939924
 ] 

Bibin A Chundatt commented on YARN-4216:


So {{yarn --daemon stop nodemanager}} should be considered as a recoverable 
case rt  and only when decommission of NM is done should have uploaded the logs 
on stop ?. 

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-01 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939830#comment-14939830
 ] 

Bibin A Chundatt commented on YARN-4216:


[~jlowe]
The problem doesnt happen when NM is killed using kill -9 before restarting.

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-10-02 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0014-YARN-4140.patch

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch, 0013-YARN-4140.patch, 0014-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat 

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-10-02 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941450#comment-14941450
 ] 

Bibin A Chundatt commented on YARN-4140:


Uploaded new patch for the same.
# String lastAnyRequestLabel = null; is not necessary. - Added check not update 
label for ANY since it gets handled already as part of code available and 
lastRequest label will be correct for queue and appUsage.
# Only formating is done now.
# Handled FIFO and Fair testcase correction

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch, 0013-YARN-4140.patch, 0014-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to 

[jira] [Updated] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-26 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4155:
---
Attachment: (was: 0003-YARN-4155.patch)

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch, 
> 0003-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-26 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4155:
---
Attachment: 0003-YARN-4155.patch

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch, 
> 0003-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4140) RM container allocation delated incase of app submitted to Nodel partition

2015-09-09 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4140:
--

 Summary: RM container allocation delated incase of app submitted 
to Nodel partition
 Key: YARN-4140
 URL: https://issues.apache.org/jira/browse/YARN-4140
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Trying to run application on Nodelabel partition I  found that the application 
execution time is delayed by 5 – 10 min for 500 containers . Total 3 machines 2 
machines were in same partition and app submitted to same.

After enabling debug was able to find the below

# From AM the container ask is for OFF-SWITCH
# RM allocating all containers to NODE_LOCAL as shown in logs below.
# So since I was having about 500 containers time taken was about – 6 minutes 
to allocated map after AM allocation.
#Tested with about 1K maps with PI job took 17 minutes to allocated the next 
container after AM allocation

Once 500 container allocation on NODE_LOCAL is done the next container 
allocation is done on OFF_SWITCH

{code}
2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
/default-rack, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: *, Relax 
Locality: true, Node Label Expression: 3}

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
host-10-19-92-143, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
host-10-19-92-117, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL
{code}
 
{code}
2015-09-09 14:35:45,467 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:45,831 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:46,469 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:46,832 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

{code}

{code}
dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
 cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
"root.b.b1" | wc -l

500
{code}
 

(Consumes about 6 minutes)

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-08 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736158#comment-14736158
 ] 

Bibin A Chundatt commented on YARN-4106:


Findbug report not showing anything looks like build report problem

> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> ---
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch, 0004-YARN-4106.patch, 0005-YARN-4106.patch, 
> 0006-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 
> *This jira also handles the below issue too*
> Timer Task not getting triggered in Nodemanager for Label update in 
> nodemanager for distributed scheduling
> Task is supposed to trigger every 
> {{yarn.nodemanager.node-labels.provider.fetch-interval-ms}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4136) LinuxContainerExecutor loses info when forwarding ResourceHandlerException

2015-09-09 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-4136:
--

Assignee: Bibin A Chundatt

> LinuxContainerExecutor loses info when forwarding ResourceHandlerException
> --
>
> Key: YARN-4136
> URL: https://issues.apache.org/jira/browse/YARN-4136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Trivial
> Attachments: 0001-YARN-4136.patch
>
>
> The Linux container executor {{launchContainer}} method throws 
> {{ResourceHandlerException}} when there are problems setting up the container 
> -but these aren't propagated in the raised IOE. They should be nested with 
> the string value included in the message text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodel partition

2015-09-09 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Description: 
Trying to run application on Nodelabel partition I  found that the application 
execution time is delayed by 5 – 10 min for 500 containers . Total 3 machines 2 
machines were in same partition and app submitted to same.

After enabling debug was able to find the below

# From AM the container ask is for OFF-SWITCH
# RM allocating all containers to NODE_LOCAL as shown in logs below.
# So since I was having about 500 containers time taken was about – 6 minutes 
to allocated map after AM allocation.
# Tested with about 1K maps using PI job took 17 minutes to allocate  next 
container after AM allocation

Once 500 container allocation on NODE_LOCAL is done the next container 
allocation is done on OFF_SWITCH

{code}
2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
/default-rack, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: *, Relax 
Locality: true, Node Label Expression: 3}

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
host-10-19-92-143, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1441791998224_0001 request={Priority: 
20, Capability: , # Containers: 500, Location: 
host-10-19-92-117, Relax Locality: true, Node Label Expression: }

2015-09-09 15:21:58,954 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL
{code}
 
{code}
2015-09-09 14:35:45,467 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:45,831 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:46,469 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

2015-09-09 14:35:46,832 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=1, numContainers=1 --> , NODE_LOCAL

{code}

{code}
dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
 cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
"root.b.b1" | wc -l

500
{code}
 

(Consumes about 6 minutes)

 

  was:
Trying to run application on Nodelabel partition I  found that the application 
execution time is delayed by 5 – 10 min for 500 containers . Total 3 machines 2 
machines were in same partition and app submitted to same.

After enabling debug was able to find the below

# From AM the container ask is for OFF-SWITCH
# RM allocating all containers to NODE_LOCAL as shown in logs below.
# So since I was having about 500 containers time taken was about – 6 minutes 
to allocated map after AM allocation.
#Tested with about 1K maps with PI job took 17 minutes to allocated the next 
container after AM allocation

Once 500 container allocation on NODE_LOCAL is done the next container 
allocation is done on OFF_SWITCH

{code}
2015-09-09 15:21:58,954 DEBUG 

[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodel partition

2015-09-09 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Summary: RM container allocation delayed incase of app submitted to Nodel 
partition  (was: RM container allocation delated incase of app submitted to 
Nodel partition)

> RM container allocation delayed incase of app submitted to Nodel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocated map after AM allocation.
> #Tested with about 1K maps with PI job took 17 minutes to allocated the next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> {code}
>  
> (Consumes about 6 minutes)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943113#comment-14943113
 ] 

Bibin A Chundatt commented on YARN-4216:


{quote}
That's why YARN-1362 was done, so we can explicitly tell the nodemanager 
whether or not the NM is under supervision and likely to restart.
{quote}
*yarn.nodemanager.recovery.supervised=false* in my current setup.
In this case as i understand from above comment i am supposed to set 
*yarn.nodemanager.recovery.supervised* as true to inform restart is under 
supervision.
 
[~jlowe] so should i close this jira ??

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943124#comment-14943124
 ] 

Bibin A Chundatt commented on YARN-4216:


[Document|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html]
 doesn't mention  about the  *yarn.nodemanager.recovery.supervised* . Should i 
update doc?

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877387#comment-14877387
 ] 

Bibin A Chundatt commented on YARN-4155:


[~ste...@apache.org]

Also tried locally in my setup its passing
{noformat}
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
Tests run: 34, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 112.588 sec - 
in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

Results :

Tests run: 34, Failures: 0, Errors: 0, Skipped: 0

{noformat}

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-18 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0006-YARN-4140.patch

Attaching patch after update 

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> {code}
>  
> (Consumes about 6 minutes)
>  



--
This 

[jira] [Commented] (YARN-4182) Killed containers disappear on app attempt page

2015-09-18 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805183#comment-14805183
 ] 

Bibin A Chundatt commented on YARN-4182:


Its similar to YARN-3603 rt?

> Killed containers disappear on app attempt page
> ---
>
> Key: YARN-4182
> URL: https://issues.apache.org/jira/browse/YARN-4182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Jeff Zhang
> Attachments: 2015-09-18_1601.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0003-YARN-4176.patch

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0002-YARN-4176.patch

Attaching patch after testcase correction.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899968#comment-14899968
 ] 

Bibin A Chundatt commented on YARN-4176:


Checkstyle issues not added as part of this patch to my understanding

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-23 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904358#comment-14904358
 ] 

Bibin A Chundatt commented on YARN-4176:


{{hadoop.yarn.logaggregation.TestAggregatedLogsBlock}} test failure is not 
related to patch attached.Checkstyle due to lines more than 2 K

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3433) Jersey tests failing with Port in Use -again

2015-09-23 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904354#comment-14904354
 ] 

Bibin A Chundatt commented on YARN-3433:


Hi [~ste...@apache.org]

The same failed with port in use again during one of ci build
https://builds.apache.org/job/PreCommit-YARN-Build/9232/testReport

> Jersey tests failing with Port in Use -again
> 
>
> Key: YARN-3433
> URL: https://issues.apache.org/jira/browse/YARN-3433
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-3433.patch
>
>
> ASF Jenkins jersey tests failing with port in use exceptions.
> The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't 
> scan for a spare port —so is too brittle on a busy server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-23 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0005-YARN-4176.patch

Thanks [~Naganarasimha] for comments

{quote}
 increase this sync duration to a greater value like 10 mins.
{quote}
The udpate interval is 10 mins so resync should be less than 10.

All other comments have handled

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0009-YARN-4140.patch

Hi [~sunilg]

Thnks for review and comments.
Have updated restcases and patch as per comments

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat 

[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-22 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903133#comment-14903133
 ] 

Bibin A Chundatt commented on YARN-4176:


Hi [~leftnoteasy]

Could you please look into this issue

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-25 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0011-YARN-4140.patch

Hi [~sunilg]

Updating params as per comment

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat 

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-25 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908798#comment-14908798
 ] 

Bibin A Chundatt commented on YARN-4140:


Failures are not related to his patch

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat 

[jira] [Commented] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900485#comment-14900485
 ] 

Bibin A Chundatt commented on YARN-4152:


Looks like only for {{LogAggregationService}} exists. 
{{ContainerEventDispatcher}} its handled.

> NM crash with NPE when LogAggregationService#stopContainer called for absent 
> container
> --
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, 
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called 
> {code}
>   case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900524#comment-14900524
 ] 

Bibin A Chundatt commented on YARN-4140:


They are related will check how to update testcases for that.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500

[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900705#comment-14900705
 ] 

Bibin A Chundatt commented on YARN-4176:


Checkstyle is due to number of lines
{noformat}
File length is 2,146 lines (max allowed is 2,000).
{noformat}
I feel can be skipped as already the number of lines were greater than 2K

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900498#comment-14900498
 ] 

Bibin A Chundatt commented on YARN-4140:


Will recheck {{TestNodeLabelContainerAllocation }} failures

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat hadoop-dsperf-resourcemanager-host-127.log | grep "NODE_LOCAL" | grep 
> "root.b.b1" | wc -l
> 500
> 

[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-22 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0008-YARN-4140.patch

Hi [~leftnoteasy]

Could you please review the patch attached.
When labels updated for *any* then pending resource usage for queue and app 
also need updated rt?. Have changed based on that .

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> 

[jira] [Updated] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4167:
---
Attachment: 0002-YARN-4167.patch

Renamed patch and uploaded the same again to trigger Jenkins


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0007-YARN-4140.patch

Hi [~leftnoteasy]

Details of testcase failures
{noformat}
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
hadoop.yarn.server.resourcemanager.TestClientRMService 
{noformat}
These two are not related to patch uploaded
{noformat}
hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
{noformat}
Fixed 

Jenkins not triggered due to {{git pull failure}}. Uploading new patch to 
trigger jenkins again

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> 

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877032#comment-14877032
 ] 

Bibin A Chundatt commented on YARN-4140:


Failures are not due to current patch.
{noformat}
hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter 
{noformat}
Reason class not found and {{TestNodeLabelContainerAllocation}} testcases 
failed on trunk even with out patch.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, 

[jira] [Updated] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4152:
---
Attachment: 0003-YARN-4152.patch

Hi [~sunilg]

Thnks for comments.Updated patch as per comments
The operation performed was kill app and context container entry is not 
available.

> NM crash with NPE when LogAggregationService#stopContainer called for absent 
> container
> --
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, 
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called 
> {code}
>   case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0001-YARN-4176.patch

Attaching patch for review.

[~Naganarasimha]
# Have taken care of configuration naming .
# Haven't changed Syste.mills as of now.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877142#comment-14877142
 ] 

Bibin A Chundatt commented on YARN-4155:


[~ste...@apache.org]
Seems this time 
{{org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testLocalFileDeletionOnDiskFull}}
 is failing . Checked {{testLogAggregationServiceWithInterval}} in patch. 
Should i handle {{testLocalFileDeletionOnDiskFull}} in this jira?

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877157#comment-14877157
 ] 

Bibin A Chundatt commented on YARN-4155:


Also tried running locally through eclipse its passing in my setup.


> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0004-YARN-4176.patch

Hi [~Naganarasimha]
Thanks for review comments 

Attaching patch after handling the same.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-23 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905818#comment-14905818
 ] 

Bibin A Chundatt commented on YARN-4140:


Synced offline with [~Naganarasimha Garla]

{quote}
If so is it req to handle incPendingResourcesForLabel and 
decPendingResourceForLabel for such requests currently we are doing for only 
previousAnyRequest only .
{quote}
Currrently the increment and decrements is happening only for *Any* request so 
updates not required .
{quote}
I think the modifications for updates would not be required when 
recoverPreemptedRequest is true right ?
{quote}
Not mandatory change to be done.
{quote}
currently a member with the same name requests
{quote}
As part of YARN-4157 the same is getting changed

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 --> 

[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-24 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907331#comment-14907331
 ] 

Bibin A Chundatt commented on YARN-4176:


Hi [~leftnoteasy]

YARN-4106 the Timer related fix for loading conf every interval is done.
So this jira cann't completely replace the same.

Only the heartbeat resend is common between both the jira's. YARN-4106 only 
failed we used to handle in this failed or success we are sending label along 
with heartbeat.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch, 0002-YARN-4176.patch, 
> 0003-YARN-4176.patch, 0004-YARN-4176.patch, 0005-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-24 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0010-YARN-4140.patch

Uploading patch after rebasing based on latest trunk

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> dsperf@host-127:/opt/bibin/dsperf/HAINSTALL/install/hadoop/resourcemanager/logs1>
>  cat 

[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-24 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907427#comment-14907427
 ] 

Bibin A Chundatt commented on YARN-4155:


Sorry, multiple comments by mistake

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-24 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907375#comment-14907375
 ] 

Bibin A Chundatt commented on YARN-4155:


Hi [~ste...@apache.org]

Issue YARN-4168 is for the above testcase failure.
Hope  all other changes required are done as part of the patch

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-24 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907374#comment-14907374
 ] 

Bibin A Chundatt commented on YARN-4155:


Hi [~ste...@apache.org]

Issue YARN-4168 is for the above testcase failure.
Hope  all other changes required are done as part of the patch

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-24 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907376#comment-14907376
 ] 

Bibin A Chundatt commented on YARN-4155:


Hi [~ste...@apache.org]

Issue YARN-4168 is for the above testcase failure.
Hope  all other changes required are done as part of the patch

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943574#comment-14943574
 ] 

Bibin A Chundatt commented on YARN-4216:


When yarn.nodemanager.recovery.supervised=true and  nodemanager stoppped abort 
aggregation is called

2015-10-05 20:17:20,634 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 *Aborting log aggregation for application_1444056058955_0002*
{noformat}
2015-10-05 20:17:20,634 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
2015-10-05 20:17:20,634 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
 waiting for pending aggregation during exit
2015-10-05 20:17:20,634 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Aborting log aggregation for application_1444056058955_0002
2015-10-05 20:17:20,634 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Aggregation did not complete for application application_1444056058955_0002
2015-10-05 20:17:20,639 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 is interrupted. Exiting.
2015-10-05 20:17:20,664 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8040
2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8040
2015-10-05 20:17:20,665 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
2015-10-05 20:17:20,665 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Public cache exiting
2015-10-05 20:17:20,665 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is 
interrupted. Exiting.
2015-10-05 20:17:20,671 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping NodeManager metrics system...
2015-10-05 20:17:20,674 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
NodeManager metrics system stopped.
{noformat}

Container logs are not cleaned up and uploaded to HDFS on stop

But decommision + nm restart while application is running should cause  the 
same  log missing scenario as per {{LogAggregationService#stopAggregators}}

{code}
 boolean supervised = getConfig().getBoolean(
YarnConfiguration.NM_RECOVERY_SUPERVISED,
YarnConfiguration.DEFAULT_NM_RECOVERY_SUPERVISED);
// if recovery on restart is supported then leave outstanding aggregations
// to the next restart
boolean shouldAbort = context.getNMStateStore().canRecover()
&& !context.getDecommissioned() && supervised;
// politely ask to finish
for (AppLogAggregator aggregator : appLogAggregators.values()) {
  if (shouldAbort) {
aggregator.abortLogAggregation();
  } else {
aggregator.finishLogAggregation();
  }
}
{code}
'

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4454) NM to nodelabel mapping going wrong after RM restart

2015-12-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061476#comment-15061476
 ] 

Bibin A Chundatt commented on YARN-4454:


On 2nd time recovery the ordering is going wrong 

{noformat}
2015-12-14 17:17:54,906 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager:   
NM=host-10-19-92-188:64318, labels=[ResourcePool_1]
2015-12-14 17:17:54,906 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager:   
NM=host-10-19-92-188:0, labels=[ResourcePool_null]{noformat}




> NM to nodelabel mapping going wrong after RM restart
> 
>
> Key: YARN-4454
> URL: https://issues.apache.org/jira/browse/YARN-4454
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>
> *Steps to reproduce*
> 1.Create cluster with 2 NM
> 2.Add label X,Y to cluster
> 3.replace  Label of node  1 using ,x
> 4.replace label for node 1 by ,y
> 5.Again replace label of node 1 by ,x
> Check cluster label mapping HOSTNAME1 will be mapped with X 
> Now restart RM 2 times NODE LABEL mapping of HOSTNAME1:PORT changes to Y
> {noformat}
> 2015-12-14 17:17:54,901 INFO 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
> 

[jira] [Created] (YARN-4454) NM to nodelabel mapping going wrong after RM restart

2015-12-14 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4454:
--

 Summary: NM to nodelabel mapping going wrong after RM restart
 Key: YARN-4454
 URL: https://issues.apache.org/jira/browse/YARN-4454
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical


*Steps to reproduce*

1.Create cluster with 2 NM
2.Add label X,Y to cluster
3.replace  Label of node  1 using ,x
4.replace label for node 1 by ,y
5.Again replace label of node 1 by ,x

Check cluster label mapping HOSTNAME1 will be mapped with X 

Now restart RM 2 times NODE LABEL mapping of HOSTNAME1:PORT changes to Y

{noformat}
2015-12-14 17:17:54,901 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 

[jira] [Updated] (YARN-4454) NM to nodelabel mapping going wrong after RM restart

2015-12-17 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4454:
---
Description: 
*Nodelabel mapping with NodeManager  is going wrong if combination of hostname 
and then NodeId is used*

*Steps to reproduce*

1.Create cluster with 2 NM
2.Add label X,Y to cluster
3.replace  Label of node  1 using ,x
4.replace label for node 1 by ,y
5.Again replace label of node 1 by ,x

Check cluster label mapping HOSTNAME1 will be mapped with X 

Now restart RM 2 times NODE LABEL mapping of HOSTNAME1:PORT changes to Y

{noformat}
2015-12-14 17:17:54,901 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 

[jira] [Updated] (YARN-4454) NM to nodelabel mapping going wrong after RM restart

2015-12-17 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4454:
---
Description: 
*Nodelabel mapping with NodeManager  is going wrong if combination of hostname 
and then NodeId is used to update nodelabel mapping*

*Steps to reproduce*

1.Create cluster with 2 NM
2.Add label X,Y to cluster
3.replace  Label of node  1 using ,x
4.replace label for node 1 by ,y
5.Again replace label of node 1 by ,x

Check cluster label mapping HOSTNAME1 will be mapped with X 

Now restart RM 2 times NODE LABEL mapping of HOSTNAME1:PORT changes to Y

{noformat}
2015-12-14 17:17:54,901 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 

[jira] [Commented] (YARN-4463) Container launch failure when yarn.nodemanager.log-dirs directory path contains space

2015-12-17 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062262#comment-15062262
 ] 

Bibin A Chundatt commented on YARN-4463:


Hi [~rohithsharma] thank you for looking into the issue .

Escaping should work havn't tested it yet . 

{noformat}
-Dyarn.app.container.log.dir=/container\ 
logs/application_1450332140888_0007/container_1450332140888_0007_01_63 
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA 
-Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild  
attempt_1450332140888_0007_m_07_2 63 1>/container\ 
logs/application_1450332140888_0007/container_1450332140888_0007_01_63/stdout
 2>/container\ 
logs/application_1450332140888_0007/container_1450332140888_0007_01_63/stderr
{noformat}

The paths mentioned above "\ " we could use in unix if path contains space . 
Will upload a patch soon.
Any other solution?

> Container launch failure when yarn.nodemanager.log-dirs directory path 
> contains space
> -
>
> Key: YARN-4463
> URL: https://issues.apache.org/jira/browse/YARN-4463
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> If the container log directory path contains space container-launch fails
> Even with DEBUG logs are enabled only log able to get is 
> {noformat}
> Container id: container_e32_1450233925719_0009_01_22
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:912)
> at org.apache.hadoop.util.Shell.run(Shell.java:823)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1102)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:304)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can support container-launch to support nmlog directory path with space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled

2015-12-17 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062269#comment-15062269
 ] 

Bibin A Chundatt commented on YARN-4465:


Hi [~leftnoteasy]/[~sunilg] any thoughts ? 

The message is confusing  *resource request=3. Queue labels=1,3* request was 
for label 3 and labels available on queue are 1,3. 

> SchedulerUtils#validateRequest for Label check should happen only when 
> nodelabel enabled
> 
>
> Key: YARN-4465
> URL: https://issues.apache.org/jira/browse/YARN-4465
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Disable label from rm side yarn.nodelabel.enable=false
> Capacity scheduler label configuration for queue is available as below
> default label for queue = b1 as 3 and accessible labels as 1,3
> Submit application to queue A .
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
>  Invalid resource request, queue=b1 doesn't have permission to access all 
> labels in resource request. labelExpression of resource request=3. Queue 
> labels=1,3
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:401)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:602)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:247)
> {noformat}
> # Ignore default label expression when label is disabled *or*
> # NormalizeResourceRequest we can set label expression to  
> when node label is not enabled *or*
> # Improve message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4463) Container launch failure when yarn.nodemanager.log-dirs directory path contains space

2015-12-18 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063818#comment-15063818
 ] 

Bibin A Chundatt commented on YARN-4463:


Hi [~sunilg]
To explain the problem further.
In launch_container.sh we create exec  command for launching yarn child  
in path cause unexpected results like failure of container launch.
URL encoding might not work in this one. Either the path should be in "" or for 
space we should use escape character. if we use   *X%20* to append out 
and err stream will create a file with name *X%20* not 
*X* file rt?

> Container launch failure when yarn.nodemanager.log-dirs directory path 
> contains space
> -
>
> Key: YARN-4463
> URL: https://issues.apache.org/jira/browse/YARN-4463
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> If the container log directory path contains space container-launch fails
> Even with DEBUG logs are enabled only log able to get is 
> {noformat}
> Container id: container_e32_1450233925719_0009_01_22
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:912)
> at org.apache.hadoop.util.Shell.run(Shell.java:823)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1102)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:304)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can support container-launch to support nmlog directory path with space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled

2015-12-16 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4465:
--

 Summary: SchedulerUtils#validateRequest for Label check should 
happen only when nodelabel enabled
 Key: YARN-4465
 URL: https://issues.apache.org/jira/browse/YARN-4465
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Disable label from rm side yarn.nodelabel.enable=false
Capacity scheduler label configuration for queue is available as below
default label for queue = b1 as 3 and accessible labels as 1,3
Submit application to queue A .

{noformat}
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
 Invalid resource request, queue=b1 doesn't have permission to access all 
labels in resource request. labelExpression of resource request=3. Queue 
labels=1,3
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:216)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:401)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:340)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:283)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:602)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:247)

{noformat}


# Ignore default label expression when label is disabled *or*
# NormalizeResourceRequest we can set label expression to  
when node label is not enabled *or*
# Improve message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4463) Container launch failure when yarn.nodemanager.log-dirs directory path contains space

2015-12-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4463:
---
Summary: Container launch failure when yarn.nodemanager.log-dirs directory 
path contains space  (was: Container launch failure when nm log directory path 
contains space)

> Container launch failure when yarn.nodemanager.log-dirs directory path 
> contains space
> -
>
> Key: YARN-4463
> URL: https://issues.apache.org/jira/browse/YARN-4463
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> If the container log directory path contains space container-launch fails
> Even with DEBUG logs are enabled only log able to get is 
> {noformat}
> Container id: container_e32_1450233925719_0009_01_22
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:912)
> at org.apache.hadoop.util.Shell.run(Shell.java:823)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1102)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:304)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can support container-launch to support nmlog directory path with space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4463) Container launch failure when nm log directory path container space

2015-12-16 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4463:
--

 Summary: Container launch failure when nm log directory path 
container space
 Key: YARN-4463
 URL: https://issues.apache.org/jira/browse/YARN-4463
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


If the container log directory path contains space container-launch fails

Even with DEBUG logs are enabled only log able to get is 

{noformat}
Container id: container_e32_1450233925719_0009_01_22
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:912)
at org.apache.hadoop.util.Shell.run(Shell.java:823)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1102)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:304)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{noformat}

We can support container-launch to support nmlog directory path with space.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4463) Container launch failure when nm log directory path contains space

2015-12-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4463:
---
Summary: Container launch failure when nm log directory path contains space 
 (was: Container launch failure when nm log directory path container space)

> Container launch failure when nm log directory path contains space
> --
>
> Key: YARN-4463
> URL: https://issues.apache.org/jira/browse/YARN-4463
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> If the container log directory path contains space container-launch fails
> Even with DEBUG logs are enabled only log able to get is 
> {noformat}
> Container id: container_e32_1450233925719_0009_01_22
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:912)
> at org.apache.hadoop.util.Shell.run(Shell.java:823)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1102)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:304)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can support container-launch to support nmlog directory path with space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-04 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4538:
--

 Summary: QueueMetrics pending  cores and memory metrics wrong
 Key: YARN-4538
 URL: https://issues.apache.org/jira/browse/YARN-4538
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Submit 2 application to default queue 
Check queue metrics for pending cores and memory

{noformat}
List allQueues = client.getChildQueueInfos("root");

for (QueueInfo queueInfo : allQueues) {
  QueueStatistics quastats = queueInfo.getQueueStatistics();
  System.out.println(quastats.getPendingVCores());
  System.out.println(quastats.getPendingMemoryMB());
}

{noformat}

*Output :*

-20
-20480




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4232) TopCLI console support for HA mode

2016-01-04 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4232:
---
Attachment: 0002-YARN-4232.patch

Attaching patch for review 
Cluster info proto added

> TopCLI console support for HA mode
> --
>
> Key: YARN-4232
> URL: https://issues.apache.org/jira/browse/YARN-4232
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4232.patch, 0002-YARN-4232.patch
>
>
> *Steps to reproduce*
> Start Top command in YARN in HA mode
> ./yarn top
> {noformat}
> usage: yarn top
>  -cols  Number of columns on the terminal
>  -delay The refresh delay(in seconds), default is 3 seconds
>  -help   Print usage; for help while the tool is running press 'h'
>  + Enter
>  -queuesComma separated list of queues to restrict applications
>  -rows  Number of rows on the terminal
>  -types Comma separated list of types to restrict applications,
>  case sensitive(though the display is lower case)
>  -users Comma separated list of users to restrict applications
> {noformat}
> Execute *for help while the tool is running press 'h'  + Enter* while top 
> tool is running
> Exception is thrown in console continuously
> {noformat}
> 15/10/07 14:59:28 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
> at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:742)
> at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:420)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-04 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082564#comment-15082564
 ] 

Bibin A Chundatt commented on YARN-4538:


[~sunilg]/[~rohithsharma]

Thank you for looking into the issue. I am not sure its the same issue. 
Currently the issue looks like in {{QueueMetrics#_decrPendingResources}} and 
{{QueueMetrics#_incrPendingResources}}

{noformat}
  private void _decrPendingResources(int containers, Resource res) {
// if #container = 0, means change container resource
pendingContainers.decr(containers);
pendingMB.decr(res.getMemory() * Math.max(containers, 1));
pendingVCores.decr(res.getVirtualCores() * Math.max(containers, 1));
  }

  private void _incrPendingResources(int containers, Resource res) {
pendingContainers.incr(containers);
pendingMB.incr(res.getMemory() * containers);
pendingVCores.incr(res.getVirtualCores() * containers);
  }
{noformat}

For increase and decrease the logic looks different.

{noformat}
[{Priority: 20, Capability: , # Containers: 1, Location: 
*, Relax Locality: true, Node Label Expression: }]  
[{Priority: 20, Capability: , # Containers: 0, Location: 
*, Relax Locality: true, Node Label Expression: }]  
{noformat}

The resource request in {{AppSchedulingInfo#updateResourceRequests}} is as 
above. And causes pendingMB and PendingVcore to be negative.
When ever allocation request contains capacity as non zero value and 
container=0 the the pending resource calculation will go wrong.

Looks related to YARN-1651. 
I will sync {{QueueMetrics#_incrPendingResources}} also same as 
{{QueueMetrics#_decrPendingResources}} and upload a patch soon

Thoughts??

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-04 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4538:
---
Priority: Critical  (was: Major)

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-05 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4538:
---
Priority: Major  (was: Critical)

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: 0003-YARN-3940.patch

[~leftnoteasy]

Attaching path for review
* Check for destination label subset of source label done and also resource 
usage check 

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: 0004-YARN-3940.patch

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch, 0004-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-06 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085917#comment-15085917
 ] 

Bibin A Chundatt commented on YARN-3940:


Uploading patch after fixing check style issue. Please do review latest patch

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch, 0004-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: (was: 0004-YARN-3940.patch)

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch, 0004-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: 0004-YARN-3940.patch

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch, 0004-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4232) TopCLI console support for HA mode

2016-01-04 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4232:
---
Attachment: (was: 0002-YARN-4232.patch)

> TopCLI console support for HA mode
> --
>
> Key: YARN-4232
> URL: https://issues.apache.org/jira/browse/YARN-4232
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4232.patch
>
>
> *Steps to reproduce*
> Start Top command in YARN in HA mode
> ./yarn top
> {noformat}
> usage: yarn top
>  -cols  Number of columns on the terminal
>  -delay The refresh delay(in seconds), default is 3 seconds
>  -help   Print usage; for help while the tool is running press 'h'
>  + Enter
>  -queuesComma separated list of queues to restrict applications
>  -rows  Number of rows on the terminal
>  -types Comma separated list of types to restrict applications,
>  case sensitive(though the display is lower case)
>  -users Comma separated list of users to restrict applications
> {noformat}
> Execute *for help while the tool is running press 'h'  + Enter* while top 
> tool is running
> Exception is thrown in console continuously
> {noformat}
> 15/10/07 14:59:28 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
> at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:742)
> at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:420)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4232) TopCLI console support for HA mode

2016-01-04 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4232:
---
Attachment: 0002-YARN-4232.patch

[~djp] /[~rohithsharma]

Attaching patch again to trigger CI, Could you please review patch attached
# Cluster info proto added to get cluster start time in Yarn CLI
# Earlier http rest api was used to get RMstarttime which will cause failure 
secure  mode and HA

> TopCLI console support for HA mode
> --
>
> Key: YARN-4232
> URL: https://issues.apache.org/jira/browse/YARN-4232
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4232.patch, 0002-YARN-4232.patch
>
>
> *Steps to reproduce*
> Start Top command in YARN in HA mode
> ./yarn top
> {noformat}
> usage: yarn top
>  -cols  Number of columns on the terminal
>  -delay The refresh delay(in seconds), default is 3 seconds
>  -help   Print usage; for help while the tool is running press 'h'
>  + Enter
>  -queuesComma separated list of queues to restrict applications
>  -rows  Number of rows on the terminal
>  -types Comma separated list of types to restrict applications,
>  case sensitive(though the display is lower case)
>  -users Comma separated list of users to restrict applications
> {noformat}
> Execute *for help while the tool is running press 'h'  + Enter* while top 
> tool is running
> Exception is thrown in console continuously
> {noformat}
> 15/10/07 14:59:28 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
> at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:742)
> at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:420)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085011#comment-15085011
 ] 

Bibin A Chundatt commented on YARN-4538:


Test failures are not related to patch attached.
All CI failures are tracked  already as part of YARN-4478

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4538.patch
>
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-05 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4538:
---
Attachment: 0001-YARN-4538.patch

Thanks for looking into the issue, [~leftnoteasy] 

* Added logic to skip pending metrics update when containers<=0
* Updating patch for the issue. Please do review 

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4538.patch
>
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084562#comment-15084562
 ] 

Bibin A Chundatt commented on YARN-4539:


[~tangshangwen]  Can we close this jira?

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-06 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086837#comment-15086837
 ] 

Bibin A Chundatt commented on YARN-4538:


Hi [~leftnoteasy]

Thank you for looking in to the patch. 
* Missed pendingResource by resource update for Container increase/decrease.
* Updated patch for the same based on Choice#2

Please do review


> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4538.patch, 0002-YARN-4538.patch
>
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4538:
---
Attachment: 0002-YARN-4538.patch

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4538.patch, 0002-YARN-4538.patch
>
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4538) QueueMetrics pending cores and memory metrics wrong

2016-01-07 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4538:
---
Attachment: 0003-YARN-4538.patch

Attaching patch after checkstyle fix

> QueueMetrics pending  cores and memory metrics wrong
> 
>
> Key: YARN-4538
> URL: https://issues.apache.org/jira/browse/YARN-4538
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4538.patch, 0002-YARN-4538.patch, 
> 0003-YARN-4538.patch
>
>
> Submit 2 application to default queue 
> Check queue metrics for pending cores and memory
> {noformat}
> List allQueues = client.getChildQueueInfos("root");
> for (QueueInfo queueInfo : allQueues) {
>   QueueStatistics quastats = queueInfo.getQueueStatistics();
>   System.out.println(quastats.getPendingVCores());
>   System.out.println(quastats.getPendingMemoryMB());
> }
> {noformat}
> *Output :*
> -20
> -20480



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-07 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: 0005-YARN-3940.patch

Thank you for looking into the patch  [~sunilg] 

I have updated patch based on comments .

Point#3  i agree that best way to do is store used labels of application . 
Currently the api is not available and resourceusage was next choice .For move 
queue the change is it required ?. Will wait for comments from [~leftnoteasy] 
too.

All other comments i have updated Attaching patch for the review.

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch, 0004-YARN-3940.patch, 0005-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission

2016-01-07 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088779#comment-15088779
 ] 

Bibin A Chundatt commented on YARN-3940:


Hi [~Naganarasimha]
Thank you for review .I have already taken care of that too 
{noformat}
   || targetqueuelabels.contains(RMNodeLabelsManager.ANY)) 
{noformat}

> Application moveToQueue should check NodeLabel permission 
> --
>
> Key: YARN-3940
> URL: https://issues.apache.org/jira/browse/YARN-3940
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch, 
> 0003-YARN-3940.patch, 0004-YARN-3940.patch, 0005-YARN-3940.patch
>
>
> Configure capacity scheduler 
> Configure node label an submit application {{queue=A Label=X}}
> Move application to queue {{B}} and x is not having access
> {code}
> 2015-07-20 19:46:19,626 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1437385548409_0005_01 released container 
> container_e08_1437385548409_0005_01_02 on node: host: 
> host-10-19-92-117:64318 #containers=1 available= 
> used= with event: KILL
> 2015-07-20 19:46:20,970 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Invalid resource ask by application appattempt_1437385548409_0005_01
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=x. Queue labels=y
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> {code}
> Same exception will be thrown till *heartbeat timeout*
> Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   8   9   10   >