[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle to prevent hot accessing nodes.

2021-04-15 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321928#comment-17321928
 ] 

Bibin Chundatt commented on YARN-10738:
---

[~zhuqi] I think this should be part of the *MultiNodeLookupPolicy * 
implementation.


> When multi thread scheduling with multi node, we should shuffle to prevent 
> hot accessing nodes.
> ---
>
> Key: YARN-10738
> URL: https://issues.apache.org/jira/browse/YARN-10738
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now the multi threading scheduling with multi node is not reasonable.
> In large clusters, it will cause the hot accessing nodes, which will lead the 
> abnormal boom node.
> Solution:
> I think we should shuffle the sorted node (such the available resource sort 
> policy) with an interval. 
> I will solve the above problem, and avoid the hot accessing node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10649) Fix RMNodeImpl.updateExistContainers leak

2021-03-04 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10649:
--
Fix Version/s: 3.3.1

> Fix RMNodeImpl.updateExistContainers leak
> -
>
> Key: YARN-10649
> URL: https://issues.apache.org/jira/browse/YARN-10649
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.1
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> YARN-5168 the patch added RMNodeImpl.updatedExistContainers, but it didn't 
> remove completed containers.
> These objects (ContainerStatusPBImpl & ContainerIdPBImpl ) stay in 
> RMNodeImpl.updatedExistContainers forever. 
> Because of this leak, ResourceManager in our production environment 
> encountered OOM issue. We found 13 million ContainerStatusPBImpl objects in 
> the heap dump file of ResourceManager.
> The patch has been applied in our production env and so far it works well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10649) Fix RMNodeImpl.updateExistContainers leak

2021-03-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295251#comment-17295251
 ] 

Bibin Chundatt edited comment on YARN-10649 at 3/5/21, 5:25 AM:


Thank you [~max2049] for contribution.. Committed to trunk and branch3.3


was (Author: bibinchundatt):
Thank you [~max2049] for contribution.. Committed to trunk

> Fix RMNodeImpl.updateExistContainers leak
> -
>
> Key: YARN-10649
> URL: https://issues.apache.org/jira/browse/YARN-10649
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.1
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> YARN-5168 the patch added RMNodeImpl.updatedExistContainers, but it didn't 
> remove completed containers.
> These objects (ContainerStatusPBImpl & ContainerIdPBImpl ) stay in 
> RMNodeImpl.updatedExistContainers forever. 
> Because of this leak, ResourceManager in our production environment 
> encountered OOM issue. We found 13 million ContainerStatusPBImpl objects in 
> the heap dump file of ResourceManager.
> The patch has been applied in our production env and so far it works well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10649) Fix RMNodeImpl.updateExistContainers leak

2021-03-04 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10649.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Thank you [~max2049] for contribution.. Committed to trunk

> Fix RMNodeImpl.updateExistContainers leak
> -
>
> Key: YARN-10649
> URL: https://issues.apache.org/jira/browse/YARN-10649
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> YARN-5168 the patch added RMNodeImpl.updatedExistContainers, but it didn't 
> remove completed containers.
> These objects (ContainerStatusPBImpl & ContainerIdPBImpl ) stay in 
> RMNodeImpl.updatedExistContainers forever. 
> Because of this leak, ResourceManager in our production environment 
> encountered OOM issue. We found 13 million ContainerStatusPBImpl objects in 
> the heap dump file of ResourceManager.
> The patch has been applied in our production env and so far it works well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10649) Fix RMNodeImpl.updateExistContainers leak

2021-03-04 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10649:
--
Summary: Fix RMNodeImpl.updateExistContainers leak  (was: ContainerIdPBImpl 
& ContainerStatusPBImpl objects can be leaked in 
RMNodeImpl.updatedExistContainers)

> Fix RMNodeImpl.updateExistContainers leak
> -
>
> Key: YARN-10649
> URL: https://issues.apache.org/jira/browse/YARN-10649
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> YARN-5168 the patch added RMNodeImpl.updatedExistContainers, but it didn't 
> remove completed containers.
> These objects (ContainerStatusPBImpl & ContainerIdPBImpl ) stay in 
> RMNodeImpl.updatedExistContainers forever. 
> Because of this leak, ResourceManager in our production environment 
> encountered OOM issue. We found 13 million ContainerStatusPBImpl objects in 
> the heap dump file of ResourceManager.
> The patch has been applied in our production env and so far it works well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-02-02 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277661#comment-17277661
 ] 

Bibin Chundatt commented on YARN-10352:
---

Committed YARN-10352-010.patch to trunk . 
Thank you [~zhuqi] and [~prabhujoseph] for the contribution and [~ztang] for 
review.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-01-22 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270068#comment-17270068
 ] 

Bibin Chundatt edited comment on YARN-10352 at 1/22/21, 11:13 AM:
--

+1 looks good to me . Wait for few days for review from [~ztang]


was (Author: bibinchundatt):
+1 look good to me . Wait for few days for review from [~ztang]

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-01-22 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270068#comment-17270068
 ] 

Bibin Chundatt commented on YARN-10352:
---

+1 look good to me . Wait for few days for review from [~ztang]

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2021-01-21 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10519:
--
Fix Version/s: 3.3.1

> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10519.v1.patch, YARN-10519.v2.patch, 
> YARN-10519.v3.patch, YARN-10519.v4.patch, YARN-10519.v5.patch, 
> YARN-10519.v6.patch, YARN-10519.v7.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2021-01-21 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269770#comment-17269770
 ] 

Bibin Chundatt commented on YARN-10519:
---

Cherry-picked to branch-3-3

> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10519.v1.patch, YARN-10519.v2.patch, 
> YARN-10519.v3.patch, YARN-10519.v4.patch, YARN-10519.v5.patch, 
> YARN-10519.v6.patch, YARN-10519.v7.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2021-01-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269067#comment-17269067
 ] 

Bibin Chundatt commented on YARN-10519:
---

Testcase failures are unrelated to patch attached and passing locally.

Thank you [~minni31] for contribution .. Committed to trunk

> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-10519.v1.patch, YARN-10519.v2.patch, 
> YARN-10519.v3.patch, YARN-10519.v4.patch, YARN-10519.v5.patch, 
> YARN-10519.v6.patch, YARN-10519.v7.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10572) Merge YARN-8557 and YARN-10352, and rebase based YARN-10380.

2021-01-19 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268380#comment-17268380
 ] 

Bibin Chundatt edited comment on YARN-10572 at 1/20/21, 6:05 AM:
-

YARN-10352 is almost close to completion only rebase of YARN-10352 is better .. 
Lets close this and rebase patch in YARN-10352


was (Author: bibinchundatt):
YARN-10352 is almost close to completion only rebase it .. Lets close this and 
rebase patch in YARN-10352

> Merge YARN-8557 and YARN-10352, and rebase based YARN-10380.
> 
>
> Key: YARN-10572
> URL: https://issues.apache.org/jira/browse/YARN-10572
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10572.001.patch
>
>
> The work is :
> 1. Because of  YARN-10380, We should rebase YARN-10352
> 2. Also merge YARN-8557 for not running case skip.
> 3. Refactor some method in YARN-10380



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10572) Merge YARN-8557 and YARN-10352, and rebase based YARN-10380.

2021-01-19 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268380#comment-17268380
 ] 

Bibin Chundatt commented on YARN-10572:
---

YARN-10352 is almost close to completion only rebase it .. Lets close this and 
rebase patch in YARN-10352

> Merge YARN-8557 and YARN-10352, and rebase based YARN-10380.
> 
>
> Key: YARN-10572
> URL: https://issues.apache.org/jira/browse/YARN-10572
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10572.001.patch
>
>
> The work is :
> 1. Because of  YARN-10380, We should rebase YARN-10352
> 2. Also merge YARN-8557 for not running case skip.
> 3. Refactor some method in YARN-10380



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2021-01-13 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264006#comment-17264006
 ] 

Bibin Chundatt commented on YARN-8557:
--

[~zhuqi] could help out in rebasing YARN-10352 

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2021-01-11 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262532#comment-17262532
 ] 

Bibin Chundatt commented on YARN-8529:
--

+1

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v10.patch, 
> YARN-8529.v11.patch, YARN-8529.v2.patch, YARN-8529.v3.patch, 
> YARN-8529.v4.patch, YARN-8529.v5.patch, YARN-8529.v6.patch, 
> YARN-8529.v7.patch, YARN-8529.v8.patch, YARN-8529.v9.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10538) Add recommissioning nodes to the list of updated nodes returned to the AM

2021-01-08 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10538.
---
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Committed to trunk and branch-3.3

> Add recommissioning nodes to the list of updated nodes returned to the AM
> -
>
> Key: YARN-10538
> URL: https://issues.apache.org/jira/browse/YARN-10538
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.1, 3.1.1
>Reporter: Srinivas S T
>Assignee: Srinivas S T
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> YARN-6483 introduced nodes that transitioned to DECOMMISSIONING state to the 
> list of updated nodes returned to the AM. This allows the Spark application 
> master to gracefully decommission its containers on the decommissioning node. 
> But if the node were to be recommissioned, the Spark application master would 
> not be aware of this. We propose to add recommissioned node to the list of 
> updated nodes sent to the AM when a recommission node transition occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-01-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259555#comment-17259555
 ] 

Bibin Chundatt commented on YARN-10352:
---

[~prabhujoseph] missed out to commit this patch.. Could you rebase it.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2021-01-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259546#comment-17259546
 ] 

Bibin Chundatt commented on YARN-8557:
--

[~zhuqi] YARN-10352 exists for similar implementation and missed to commit the 
same.

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2021-01-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258062#comment-17258062
 ] 

Bibin Chundatt commented on YARN-10519:
---

+1 Looks good to me.. 

[~sunil.gov...@gmail.com] could you take a look too.

> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-10519.v1.patch, YARN-10519.v2.patch, 
> YARN-10519.v3.patch, YARN-10519.v4.patch, YARN-10519.v5.patch, 
> YARN-10519.v6.patch, YARN-10519.v7.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2020-12-18 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251692#comment-17251692
 ] 

Bibin Chundatt commented on YARN-10519:
---

Over all the path looks good. Few Minor comments 

* Update javadoc for registerCustomResources  and add for CustomResourceMetrics 
 metrics



> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-10519.v1.patch, YARN-10519.v2.patch, 
> YARN-10519.v3.patch, YARN-10519.v4.patch, YARN-10519.v5.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2020-12-13 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248744#comment-17248744
 ] 

Bibin Chundatt commented on YARN-10519:
---

Few minor comments

# Change the visibility of queueMetricsForCustomResources to protected in 
QueueMetrics  and avoid ref in CSQueueMetrics
{code}
225   this.csQueueMetricsForCustomResources =
226   new CSQueueMetricsForCustomResources();
227   
setQueueMetricsForCustomResources(csQueueMetricsForCustomResources);
{code}
# Remove new line
{code}
644 
677   private void incrementPendingResources(int containers, Resource res) 
{645   private void incrementPendingResources(int containers, 
Resource res) {
{code}

> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-10519.v1.patch, YARN-10519.v2.patch, 
> YARN-10519.v3.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10519) Refactor QueueMetricsForCustomResources class to move to yarn-common package

2020-12-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245699#comment-17245699
 ] 

Bibin Chundatt commented on YARN-10519:
---

Thank you [~minni31] for working on this.

# Reserved and pending is applicable only for Queues. Common could be allocated 
and available 

> Refactor QueueMetricsForCustomResources class to move to yarn-common package
> 
>
> Key: YARN-10519
> URL: https://issues.apache.org/jira/browse/YARN-10519
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-10519.v1.patch
>
>
> Refactor the code for QueueMetricsForCustomResources to move the base classes 
> to yarn-common package. This helps in reusing the class in adding custom 
> resource types at NM level also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10476) Queue metrics for Unmanaged applications

2020-11-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226722#comment-17226722
 ] 

Bibin Chundatt commented on YARN-10476:
---

Thank [~cyrusjackson25] you
# Instead of creating separate method for move , finish etc.. we could just 
have a boolean to understand the type rt..
# Could you add a test to verify the recovery side also.

>  Queue metrics for Unmanaged applications
> -
>
> Key: YARN-10476
> URL: https://issues.apache.org/jira/browse/YARN-10476
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Cyrus Jackson
>Assignee: Cyrus Jackson
>Priority: Minor
> Attachments: YARN-10476.001.patch, YARN-10476.002.patch, 
> YARN-10476.003.patch
>
>
> Right now we do not have separate metrics unmanaged applications. All 
> application metrics come as part of Queue (Managed and UnManaged), This Jira 
> aims to show them separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-11-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224437#comment-17224437
 ] 

Bibin Chundatt commented on YARN-10475:
---

Sure lets have a follow up JIRA to work on making this pluggable..

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223491#comment-17223491
 ] 

Bibin Chundatt edited comment on YARN-10475 at 10/30/20, 8:36 AM:
--

Thank you  [~Jim_Brennan]  working on this.  

Could you make the implementation generics to plugin other policies too. Cpu 
utlization could be one of the policy which helps in deciding the HB interval. 
thoughts?



was (Author: bibinchundatt):
Thank you  [~Jim_Brennan]  working on this.  

Could you make the implementation generics to plugin other policies too. Cpu 
utlization only of the policy which helps in deciding the HB interval. thoughts?


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223491#comment-17223491
 ] 

Bibin Chundatt commented on YARN-10475:
---

Thank you  [~Jim_Brennan]  working on this.  

Could you make the implementation generics to plugin other policies too. Cpu 
utlization only of the policy which helps in deciding the HB interval. thoughts?


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-31 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10395:
---

Reopening since its not committed to any branch..

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172949#comment-17172949
 ] 

Bibin Chundatt commented on YARN-10352:
---

+1 for the latest patch. Will commit the same by EOD , if no objections.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10388) RMNode updatedCapability flag not set while RecommissionNodeTransition

2020-08-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172128#comment-17172128
 ] 

Bibin Chundatt commented on YARN-10388:
---

Over all the patch looks good to me.. +1. Wait for jenkins result..
cc : [~inigoiri] 

> RMNode updatedCapability flag not set while RecommissionNodeTransition
> --
>
> Key: YARN-10388
> URL: https://issues.apache.org/jira/browse/YARN-10388
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Major
> Attachments: YARN-10388.001.patch
>
>
> RMNode updatedCapability flag is not set while RecommissionNodeTransition 
> happens. RM gets updated of new totalcapability when recommissioning of node 
> happens. But the nodemanager still has old totalcapability and is not aware 
> of the change. Setting this flag while RecommissionNodeTransition  would 
> update nodemanager of totalcapability change as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10388) RMNode updatedCapability flag not set while RecommissionNodeTransition

2020-08-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172043#comment-17172043
 ] 

Bibin Chundatt commented on YARN-10388:
---

Good catch [~lapjarn] .

> RMNode updatedCapability flag not set while RecommissionNodeTransition
> --
>
> Key: YARN-10388
> URL: https://issues.apache.org/jira/browse/YARN-10388
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Major
>
> RMNode updatedCapability flag is not set while RecommissionNodeTransition 
> happens. RM gets updated of new totalcapability when recommissioning of node 
> happens. But the nodemanager still has old totalcapability and is not aware 
> of the change. Setting this flag while RecommissionNodeTransition  would 
> update nodemanager of totalcapability change as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-08-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172042#comment-17172042
 ] 

Bibin Chundatt commented on YARN-10335:
---

[~sunilg] Could you take a look..

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
> Attachments: YARN-10335.001.patch, YARN-10335.002.patch, 
> YARN-10335.003.patch, YARN-10335.004.patch
>
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170656#comment-17170656
 ] 

Bibin Chundatt commented on YARN-10352:
---

Thank you [~prabhujoseph] for patch.

 Just few queries / comments
 # The customer iterator how much improvement we have against the 
*Iterators.filter* ?
 # Can we avoid the multiplier * 2 and make it configurable.. The multiplier 
could go wrong when the dispatcher is overloaded . processing events for large 
clusters could be slow . >2 seconds the events could stay in async dispatcher .
 # MultiNodeSortingManager the imports could be ordered.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-08-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169246#comment-17169246
 ] 

Bibin Chundatt commented on YARN-10335:
---

Thank you [~cyrusjackson25] for patch

Just did a brief look at the patch .. Few comments

# Could be assigned to *NodeHealthCheckerService*
{noformat}
107 NodeHealthCheckerServiceImpl healthChecker =
108 createNodeHealthCheckerService();
{noformat}
# Update to readlock for get API
{noformat}
528   public NodeHealthDetails getNodeHealthDetails() {
529 this.writeLock.lock();
530 
531 try {
532   return this.nodeHealthDetails;
533 } finally {
534   this.writeLock.unlock();
535 }
536   }
{noformat}
# Fix all jenkins erros..

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
> Attachments: YARN-10335.001.patch, YARN-10335.002.patch, 
> YARN-10335.003.patch, YARN-10335.004.patch
>
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10359) Log container report only if list is not empty

2020-08-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169243#comment-17169243
 ] 

Bibin Chundatt commented on YARN-10359:
---

+1 committing shortly

> Log container report only if list is not empty
> --
>
> Key: YARN-10359
> URL: https://issues.apache.org/jira/browse/YARN-10359
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10359.001.patch, YARN-10359.002.patch
>
>
> In NodeStatusUpdaterImpl print log only if containerReports list is  not empty
> {code:java}
> if (containerReports != null) {
> LOG.info("Registering with RM using containers :" + containerReports);
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10369) Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG

2020-07-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166332#comment-17166332
 ] 

Bibin Chundatt commented on YARN-10369:
---

[~Jim_Brennan] .

In addition to above comment please do use  {}-placeholders too for logging

> Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG
> --
>
> Key: YARN-10369
> URL: https://issues.apache.org/jira/browse/YARN-10369
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10369.001.patch
>
>
> This message is logged at the info level, but it doesn't really add much 
> information.
> We changed this to DEBUG internally years ago and haven't missed it.
> {noformat}
> 2020-07-27 21:51:29,027 INFO  [RM Event dispatcher] 
> security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken 
> for nodeId : localhost.localdomain:45454 for container : 
> container_1595886659189_0001_01_01
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add capacityScheduler metric for NODE_UPDATE interval

2020-07-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166219#comment-17166219
 ] 

Bibin Chundatt commented on YARN-10208:
---

Missed committing this JIRA.. The testcase failures are not related to patch 
attached
Committing shortly

> Add capacityScheduler metric for NODE_UPDATE interval
> -
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, 
> YARN-10208.006.patch, YARN-10208.007.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10208) Add capacityScheduler metric for NODE_UPDATE interval

2020-07-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10208:
--
Summary: Add capacityScheduler metric for NODE_UPDATE interval  (was: 
CapacityScheduler metric for evaluating the time difference between node 
heartbeats)

> Add capacityScheduler metric for NODE_UPDATE interval
> -
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, 
> YARN-10208.006.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10208) CapacityScheduler metric for evaluating the time difference between node heartbeats

2020-07-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10208:
--
Summary: CapacityScheduler metric for evaluating the time difference 
between node heartbeats  (was: Add metric in CapacityScheduler for evaluating 
the time difference between node heartbeats)

> CapacityScheduler metric for evaluating the time difference between node 
> heartbeats
> ---
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch, 
> YARN-10208.006.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163246#comment-17163246
 ] 

Bibin Chundatt commented on YARN-10315:
---

+1 looks good to me .

[~adam.antal] will wait for  fee days before committing.

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10356) Consider node labels also for centralized O scheduling

2020-07-21 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10356:
-

 Summary: Consider node labels also for centralized O scheduling
 Key: YARN-10356
 URL: https://issues.apache.org/jira/browse/YARN-10356
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bibin Chundatt






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161114#comment-17161114
 ] 

Bibin Chundatt commented on YARN-10315:
---

Thank you [~Sushil-K-S] for the patch.

Over all patch looks good to me ..

[~adam.antal] any comments 

> Avoid sending RMNodeResoureupdate event if resource is same
> ---
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161114#comment-17161114
 ] 

Bibin Chundatt edited comment on YARN-10315 at 7/20/20, 10:17 AM:
--

Thank you [~Sushil-K-S] for the patch.

Over all patch looks good to me .. Fix the whitespace errors..

[~adam.antal] any comments 


was (Author: bibinchundatt):
Thank you [~Sushil-K-S] for the patch.

Over all patch looks good to me ..

[~adam.antal] any comments 

> Avoid sending RMNodeResoureupdate event if resource is same
> ---
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160936#comment-17160936
 ] 

Bibin Chundatt edited comment on YARN-10352 at 7/20/20, 6:27 AM:
-

[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration than 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 


was (Author: bibinchundatt):
[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration that 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160936#comment-17160936
 ] 

Bibin Chundatt commented on YARN-10352:
---

[~prabhujoseph]

With current approach we are iterating through all the nodes 2 times in the 
partition.

We could filter out the nodes during the {{reSortClusterNodes}} iteration that 
creating a list then iterating it all over it again. thoughts ?
 One more additional filter to {{preferrednodeIterator}} while querying nodes 
per schedulerKey would reduce the node selection being done during sorting 
interval of 5 sec.

Iterators.filter(iterator, 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10350) TestUserGroupMappingPlacementRule fails

2020-07-16 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10350:
--
Fix Version/s: 3.4.0

> TestUserGroupMappingPlacementRule fails
> ---
>
> Key: YARN-10350
> URL: https://issues.apache.org/jira/browse/YARN-10350
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10350.001.patch, YARN-10350.002.patch
>
>
> TestUserGroupMappingPlacementRule fails on trunk:
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule
> [ERROR] Tests run: 31, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 
> 2.662 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule
> [ERROR] 
> testResolvedQueueIsNotManaged(org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule)
>   Time elapsed: 0.03 s  <<< ERROR!
> java.lang.Exception: Unexpected exception, 
> expected but 
> was
>   at 
> org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:28)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: java.lang.AssertionError: Queue expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.verifyQueueMapping(TestUserGroupMappingPlacementRule.java:236)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule.testResolvedQueueIsNotManaged(TestUserGroupMappingPlacementRule.java:516)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:19)
>   ... 18 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-07-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153234#comment-17153234
 ] 

Bibin Chundatt commented on YARN-10335:
---

Thank you [~cyrusjackson25] for working in this

Few comments:


# Refer NodeHealthStatus for how the records needs to implemented. Define as 
abstract and also add comments.
# setNodeResources -> setNodeResourceScore also rename the variables too.
#  Finding addition description detail why did we add this ??
 {noformat}
  optional string node_health_description = 4;
 {noformat}
# NodeHealthService  instead of *getNodeHealthDetails* we could add 
updateNodeHealthDetails
# Add Visibility Annotation as private

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
> Attachments: YARN-10335.001.patch
>
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-07-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151483#comment-17151483
 ] 

Bibin Chundatt commented on YARN-10335:
---

[~subru]/[~sunilg]  Does the proto structure look good  ?

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health

2020-07-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt edited comment on YARN-10335 at 7/5/20, 6:22 AM:


Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{noformat}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
}

message NodeHealthDetail{
 optional int32 overallscore=1;
 optional StringIntMapProto nodeResources =2 ;
}
message StringIntMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be -  ssd, non ssd, etc.. 
{noformat}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.


was (Author: bibinchundatt):
Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be -  ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health

2020-07-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt edited comment on YARN-10335 at 7/5/20, 6:19 AM:


Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be -  ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.


was (Author: bibinchundatt):
Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-04 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151290#comment-17151290
 ] 

Bibin Chundatt commented on YARN-10332:
---

[~yehuanhuan] My bad .

Statetransition  is defined twice makes sense to remove it. Misunderstood the 
JIRA as YARN-10315.
+1 for the change.


> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt edited comment on YARN-10335 at 7/2/20, 4:45 AM:


Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding what i have in mind about the health detail. Node manager  has node 
health service which returns a boolean value .Sends UNHEALTHY if the node 
health script return error / If  we don't have any healthy local  directories. 

We will introduce field/fields which returns detailed node health value about 
the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.


was (Author: bibinchundatt):
Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding the thought what i have in mind about the health value. Node manager  
has node health service which returns a boolean value . 
Sends UNHEALTHY if the node health script return error / If  we don't have any 
healthy local  directories. 

We want to introduce field/fields which returns detailed node health value 
about the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149843#comment-17149843
 ] 

Bibin Chundatt commented on YARN-10335:
---

Thank you for showing interest in the JIRA [~cyrusjackson25]

Adding the thought what i have in mind about the health value. Node manager  
has node health service which returns a boolean value . 
Sends UNHEALTHY if the node health script return error / If  we don't have any 
healthy local  directories. 

We want to introduce field/fields which returns detailed node health value 
about the node along with the NodeHealthStatus.  

Example:
{quote}
message NodeHealthStatusProto {
optional bool isHealthy = 1;
optional string nodeHealthDescription = 2;
optional string exceptionString = 3;
optional NodeHealthDetail nodehealthDetail=4;
optional StringIntMapProto nodeHealthdetail=5;
}

message StringStringMapProto {
  optional string key = 1;
  optional int32 value = 2;
}

keys could be - overall , ssd, non ssd, etc.. 
{quote}

Also make the NodeHealthService pluggable to support custom implementations of 
NodeHealthServices.

> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Cyrus Jackson
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10335:
--
Description: 
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on node 
health value send from nodemanagers

  was:
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth value


> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on node 
> health value send from nodemanagers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10335:
--
Description: 
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth value

  was:
YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth.


> Improve scheduling of containers based on node health
> -
>
> Key: YARN-10335
> URL: https://issues.apache.org/jira/browse/YARN-10335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Major
>
> YARN-7494 supports providing interface to choose nodeset for scheduler 
> allocation.
> We could leverage the same to support allocation of containers based on 
> nodehealth value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10335) Improve scheduling of containers based on node health

2020-07-01 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10335:
-

 Summary: Improve scheduling of containers based on node health
 Key: YARN-10335
 URL: https://issues.apache.org/jira/browse/YARN-10335
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt


YARN-7494 supports providing interface to choose nodeset for scheduler 
allocation.
We could leverage the same to support allocation of containers based on 
nodehealth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166
 ] 

Bibin Chundatt edited comment on YARN-10332 at 7/1/20, 6:56 AM:


[~yehuanhuan] looks like duplicate of YARN-10315. 

Current change will create InvalidStateTransitionException when Node is in 
decommissioning state and admin is calling node resource update.. Also during 
node update..



was (Author: bibinchundatt):
[~yehuanhuan] looks like duplicate of YARN-10315. 

Current change is got in create InvalidStateTransitionException when Node is in 
decommissioning state and admin is calling node resource update.. Also during 
node update..


> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166
 ] 

Bibin Chundatt edited comment on YARN-10332 at 7/1/20, 6:56 AM:


[~yehuanhuan] looks like duplicate of YARN-10315. 

Current change is got in create InvalidStateTransitionException when Node is in 
decommissioning state and admin is calling node resource update.. Also during 
node update..



was (Author: bibinchundatt):
[~yehuanhuan] looks like duplicate of YARN-10315. 

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-01 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149166#comment-17149166
 ] 

Bibin Chundatt commented on YARN-10332:
---

[~yehuanhuan] looks like duplicate of YARN-10315. 

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10315) Avoid sending RMNodeResoureupdate event if resource is same

2020-06-14 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10315:
-

 Summary: Avoid sending RMNodeResoureupdate event if resource is 
same
 Key: YARN-10315
 URL: https://issues.apache.org/jira/browse/YARN-10315
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt


When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is send 
for every heartbeat . Which will result in scheduler resource update.

Avoid sending the same.

 Scheduler node resource update iterates through all the queues for resource 
update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-07 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10307:
---

Reopening to set the correct resolution

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
> Fix For: 3.1.2
>
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at 

[jira] [Resolved] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-07 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10307.
---
Fix Version/s: (was: 3.1.2)
   Resolution: Invalid

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)

[jira] [Comment Edited] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101380#comment-17101380
 ] 

Bibin Chundatt edited comment on YARN-10259 at 5/7/20, 6:00 AM:


In addition to the above.. 

I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We 
do try the container allocation from first node we get iterating through all 
the candidate set.
Change to previous logic. 

 Issue -. container gets unreserved on node1. then again we reserve on node 1 
during allocation .. The nodes in the last in list with reserved containers  
might never get a chance to do allocation./ unreservation.

This impacts performance of multiNodelookup too. *AsyncSchedulerThread* give a 
fair chance to all nodes to do unreserve/allocate for reserved container.
Attempt allocation if reserved container exists with a single candidate nodeset.



was (Author: bibinchundatt):
In addition to the above.. 

I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We 
do try the container allocation from first node we get iterating through all 
the candidate set.
Change to previous logic. 

 Issue -. container gets unreserved on node1. then again we reserve on node 1 
during allocation .. The nodes in the last in list with reserved containers  
might never get a chance to do allocation./ unreservation.

This impacts performance of multiNodelookup too. AsyncSchedulerThread give a 
fair change to each node to do unreserve/allocate from reserved container.
Attempt allocation if reserved container exists with a single candidate nodeset.


> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: REPRO_TEST.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101380#comment-17101380
 ] 

Bibin Chundatt commented on YARN-10259:
---

In addition to the above.. 

I think the issue exists in the *LeafQueue#allocateFromReservedContainer* .. We 
do try the container allocation from first node we get iterating through all 
the candidate set.
Change to previous logic. 

 Issue -. container gets unreserved on node1. then again we reserve on node 1 
during allocation .. The nodes in the last in list with reserved containers  
might never get a chance to do allocation./ unreservation.

This impacts performance of multiNodelookup too. AsyncSchedulerThread give a 
fair change to each node to do unreserve/allocate from reserved container.
Attempt allocation if reserved container exists with a single candidate nodeset.


> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: REPRO_TEST.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100769#comment-17100769
 ] 

Bibin Chundatt commented on YARN-10259:
---

[~prabhujoseph]

I think we have few issue in the RegularContainerAllocator#allocate 

 #   Only when the preCheckForNodeCandidateSet check  fails for 
*appInfo.precheckNode* we should be continuing the iteration over next set of 
nodes.
 #  preCheckForNodeCandidateSet returns null try allocation
 #  All other cases return preCheckForNodeCandidateSet(..)
 #  if we have reserved container and for scheduler key the pending 
ask is zero. Unreserve the container.
{code}
   if (application.getOutstandingAsksCount(schedulerKey) == 
0) {
  // Release
  return new ContainerAllocation(reservedContainer, null,
  AllocationState.QUEUE_SKIPPED);

}
{code}
   # The *schedulingPS.getPreferredNodeIterator* i think we should 
filter out all the nodes with reserved containers. This should reduce the 
reservation.
   
 


> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: REPRO_TEST.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10246) Enable YARN Router to have a dedicated Zookeeper

2020-05-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099735#comment-17099735
 ] 

Bibin Chundatt edited comment on YARN-10246 at 5/5/20, 9:48 AM:


[~dmmkr] 

In case of non secure cluster this could work with a different property names.. 
But i am not sure how this could work in secure cluster.
Does curator support multiple kerboros configuration  in same process (RM is 
the process here.) RM has to connect to Federation state store and also RM 
State Store..

IIRC the version of curator doesnt support the same. 


was (Author: bibinchundatt):
[~dmmkr] 

In case of non secure cluster this could work with a different configuration 
file.. But i am not sure how this could work in secure cluster.
Does curator support multiple kerboros configuration  in same process (RM is 
the process here.) RM has to connect to Federation state store and also RM 
State Store..

IIRC the version of curator doesnt support the same. 

> Enable YARN Router to have a dedicated Zookeeper
> 
>
> Key: YARN-10246
> URL: https://issues.apache.org/jira/browse/YARN-10246
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation, router
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10246.001.patch, YARN-10246.002.patch
>
>
> Currently, we have a single parameter hadoop.zk.address for Router and 
> Resourcemanager, Due to this we need have FederationStateStore and 
> RMStateStore on the same Zookeeper instance. 
> With the above topology there can be a load on ZooKeeper, since all 
> subcluster RMs will write to single ZooKeeper.
> So, If we Introduce a new configuration such as hadoop.federation.zk.address 
> we can have FederationStateStore on a dedicated Zookeeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10246) Enable YARN Router to have a dedicated Zookeeper

2020-05-05 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099735#comment-17099735
 ] 

Bibin Chundatt commented on YARN-10246:
---

[~dmmkr] 

In case of non secure cluster this could work with a different configuration 
file.. But i am not sure how this could work in secure cluster.
Does curator support multiple kerboros configuration  in same process (RM is 
the process here.) RM has to connect to Federation state store and also RM 
State Store..

IIRC the version of curator doesnt support the same. 

> Enable YARN Router to have a dedicated Zookeeper
> 
>
> Key: YARN-10246
> URL: https://issues.apache.org/jira/browse/YARN-10246
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation, router
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10246.001.patch, YARN-10246.002.patch
>
>
> Currently, we have a single parameter hadoop.zk.address for Router and 
> Resourcemanager, Due to this we need have FederationStateStore and 
> RMStateStore on the same Zookeeper instance. 
> With the above topology there can be a load on ZooKeeper, since all 
> subcluster RMs will write to single ZooKeeper.
> So, If we Introduce a new configuration such as hadoop.federation.zk.address 
> we can have FederationStateStore on a dedicated Zookeeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10229) [Federation] Client should be able to submit application to RM directly using normal client conf

2020-05-03 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098655#comment-17098655
 ] 

Bibin Chundatt commented on YARN-10229:
---

[~BilwaST] /[~122512...@qq.com]

Nodemanagers need to stay independent of the applications . Parsing of 
application specific details are not suggested in nodemanager side.

Alternate Solution:

Currently AMRMProxyService overrides the AMRMToken always. If we could notify 
from interceptors whether the amrmtoken needs to be override , then we should 
be able to submit. In this case the FederationInterceptor could check the 
homeapplications entry is available in federation state store. 

Thoughts??

> [Federation] Client should be able to submit application to RM directly using 
> normal client conf
> 
>
> Key: YARN-10229
> URL: https://issues.apache.org/jira/browse/YARN-10229
> Project: Hadoop YARN
>  Issue Type: Wish
>  Components: amrmproxy, federation
>Affects Versions: 3.1.1
>Reporter: JohnsonGuo
>Assignee: Bilwa S T
>Priority: Major
>
> Scenario: When enable the yarn federation feature with multi yarn clusters, 
> one can submit their job to yarn-router by *modified* their client 
> configuration with yarn router address.
> But if one still wants to submit their jobs via the original client (before 
> enable federation) to RM directly, it will encounter the AMRMToken exception. 
>  That means once enable federation ,if some one want to submit job, they have 
> to  modify the client conf.
>  
> one possible solution for this Scenario is:
> In NodeManger, when the client ApplicationMaster request comes:
>  * get the client job.xml  from HDFS "".
>  * parse the "yarn.resourcemanager.scheduler.address" parameter in job.xml
>  * if the value of the parameter is "localhost:8049"(AMRM address),then do 
> the AMRMToken valid process
>  * if the value of the parameter is "rm:port"(rm address),then skip the 
> AMRMToken valid process
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-04-06 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076865#comment-17076865
 ] 

Bibin Chundatt commented on YARN-10208:
---

Thank you [~adam.antal] for additional review. Will wait for a day before 
commit.

> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-03-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071476#comment-17071476
 ] 

Bibin Chundatt commented on YARN-10208:
---

+1 looks good to me. 

> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-03-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070748#comment-17070748
 ] 

Bibin Chundatt commented on YARN-10208:
---

[~lapjarn]

Minor nit:

schedulerHeartBeatIntervalAverage  variable and method name rename to 
schedulerNodeHBInterval

> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9627) DelegationTokenRenewer could block transitionToStandy

2020-03-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reassigned YARN-9627:


Assignee: (was: Bibin Chundatt)

> DelegationTokenRenewer could block transitionToStandy
> -
>
> Key: YARN-9627
> URL: https://issues.apache.org/jira/browse/YARN-9627
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: krishna reddy
>Priority: Critical
> Attachments: YARN-9627.001.patch, YARN-9627.002.patch, 
> YARN-9627.003.patch
>
>
> Cluster size: 5K
> Running containers: 55K
> *Scenario*: Largenumber of pending applications (around 50K) and performing 
> RM switch over
> Below exception :
> {noformat}
> 2019-06-13 17:39:27,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renew Kind: HDFS_DELEGATION_TOKEN, Service: X:1616, Ident: (token 
> for root: HDFS_DELEGATION_TOKEN owner=root/had...@hadoop.com, renewer=yarn, 
> realUser=, issueDate=1560361265181, maxDate=1560966065181, 
> sequenceNumber=104708, masterKeyId=3);exp=1560533965360; 
> apps=[application_1560346941775_20702] in 86397766 ms, appId = 
> [application_1560346941775_20702]
> 2019-06-13 17:39:27,609 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:522)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> 2019-06-13 17:58:20,878 ERROR org.apache.zookeeper.ClientCnxn: Time out error 
> occurred for the packet 'clientPath:null serverPath:null finished:false 
> header:: 27,4  replyHeader:: 27,4295687588,0  request:: 
> '/rmstore1/ZKRMStateRoot/RMDTSecretManagerRoot/RMDTMasterKeysRoot/DelegationKey_49,F
>   response:: 
> #31ff8a16b74ffe129768ffdbffe949ff8dffd517ffcafffa,s{4295423577,4295423577,1560342837789,1560342837789,0,0,0,0,17,0,4295423577}
>  '.
> 2019-06-13 17:58:20,877 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 
> X:1616, Ident: (token for root: HDFS_DELEGATION_TOKEN 
> owner=root/had...@hadoop.com, renewer=yarn, realUser=, 
> issueDate=1560366110990, maxDate=1560970910990, sequenceNumber=111891, 
> masterKeyId=3);exp=1560534896413; apps=[application_1560346941775_28115]]
> 2019-06-13 17:58:20,924 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer on recovery.
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:397)
> at java.util.Timer.schedule(Timer.java:208)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:612)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:523)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-10172) Default ApplicationPlacementType class should be configurable

2020-03-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068619#comment-17068619
 ] 

Bibin Chundatt commented on YARN-10172:
---

[~cyrusjackson25]

Please check the checkstyle issues.Apart from that changes looks good to me.  

{code}
184 String DEFAULT_APPLICATION_PLACEMENT_TYPE_CLASS = 
"org.apache.hadoop.yarn."
185 + "server.resourcemanager.scheduler.capacity."
186 + "yarnpp.YarnppLocalityAppPlacementAllocator";
{code}
# Rename  YarnppLocalityAppPlacementAllocator -> 
DummyLocalityAppPlacementAllocator
# The package name also could be short.


[~sunil.gov...@gmail.com] Would you  like take a look

> Default ApplicationPlacementType class should be configurable
> -
>
> Key: YARN-10172
> URL: https://issues.apache.org/jira/browse/YARN-10172
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Cyrus Jackson
>Assignee: Cyrus Jackson
>Priority: Minor
> Attachments: YARN-10172.001.patch
>
>
> This can be useful in scheduling apps based on the configured placement type 
> class rather than resorting to LocalityAppPlacementAllocator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-10182:
---

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-27 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10182.
---
Resolution: Not A Problem

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-03-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068603#comment-17068603
 ] 

Bibin Chundatt commented on YARN-10208:
---

[~lapjarn]
Minor comment.

{code}
1834  // Add metrics for evaluating the time difference between 
heartbeats.
1835  SchedulerNode node =
1836  nodeTracker.getNode(nodeUpdatedEvent.getRMNode().getNodeID());
1837  if (node != null) {
1838long lastInterval =
1839Time.monotonicNow() - node.getLastHeartbeatMonotonicTime();
1840CapacitySchedulerMetrics.getMetrics()
1841.addSchedulerHeartBeatIntervalAverage(lastInterval);
1842  }
{code}
Refactor to method and update before the node update call


> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10181) Managing Centralized Node Attribute via RMWebServices.

2020-03-17 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060849#comment-17060849
 ] 

Bibin Chundatt commented on YARN-10181:
---

Could you move this jira as part of YARN-8766 ? 

> Managing Centralized Node Attribute via RMWebServices.
> --
>
> Key: YARN-10181
> URL: https://issues.apache.org/jira/browse/YARN-10181
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodeattibute
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> Currently Centralized NodeAttributes can be managed only through Yarn 
> NodeAttribute CLI. This is to support via RMWebServices.
> {code}
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeAttributes.html#Centralised_Node_Attributes_mapping.
> Centralised : Node to attributes mapping can be done through RM exposed CLI 
> or RPC (REST is yet to be supported).
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6924) Metrics for Federation AMRMProxy

2020-03-02 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049835#comment-17049835
 ] 

Bibin Chundatt commented on YARN-6924:
--

[~youchen]

Over all the patch looks good to me. Wait for a day  to commit the same..

> Metrics for Federation AMRMProxy
> 
>
> Key: YARN-6924
> URL: https://issues.apache.org/jira/browse/YARN-6924
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-6924.01.patch, YARN-6924.01.patch, 
> YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, 
> YARN-6924.04.patch, YARN-6924.05.patch
>
>
> This JIRA proposes addition of metrics for Federation AMRMProxy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6924) Metrics for Federation AMRMProxy

2020-02-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046785#comment-17046785
 ] 

Bibin Chundatt edited comment on YARN-6924 at 2/27/20 4:33 PM:
---

[~youchen]

Over all the patch looks good.. 

Minor nits : 

* Annotation and the method signature to be in different lines
* Same applies for the variables too in AMRMProxyMetrics.
* Since the testcase are in same package the visibility for get methods could 
be package private.
* Correct the apache source file copyright headers too.


was (Author: bibinchundatt):
[~youchen]

Over all the patch looks good.. 

Minor nits : 

* Annotation and the method signature to be in different lines
* Same applies for the variables too in AMRMProxyMetrics.
* Since the testcase are in same package the visibility for get methods could 
be package private.

> Metrics for Federation AMRMProxy
> 
>
> Key: YARN-6924
> URL: https://issues.apache.org/jira/browse/YARN-6924
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-6924.01.patch, YARN-6924.01.patch, 
> YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, YARN-6924.04.patch
>
>
> This JIRA proposes addition of metrics for Federation AMRMProxy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6924) Metrics for Federation AMRMProxy

2020-02-27 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046785#comment-17046785
 ] 

Bibin Chundatt commented on YARN-6924:
--

[~youchen]

Over all the patch looks good.. 

Minor nits : 

* Annotation and the method signature to be in different lines
* Same applies for the variables too in AMRMProxyMetrics.
* Since the testcase are in same package the visibility for get methods could 
be package private.

> Metrics for Federation AMRMProxy
> 
>
> Key: YARN-6924
> URL: https://issues.apache.org/jira/browse/YARN-6924
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-6924.01.patch, YARN-6924.01.patch, 
> YARN-6924.02.patch, YARN-6924.02.patch, YARN-6924.03.patch, YARN-6924.04.patch
>
>
> This JIRA proposes addition of metrics for Federation AMRMProxy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10110) In Federation Secure cluster Application submission fails when authorization is enabled

2020-02-18 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039688#comment-17039688
 ] 

Bibin Chundatt commented on YARN-10110:
---

[~BilwaST] Could you point me to the jira which supports Federation security. 

Also its better to  group all the security related federation  under one 
subtask and link it to YARN-5597 

> In Federation Secure cluster Application submission fails when authorization 
> is enabled
> ---
>
> Key: YARN-10110
> URL: https://issues.apache.org/jira/browse/YARN-10110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sushanta Sen
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-10110.001.patch, YARN-10110.002.patch
>
>
> 【Precondition】:
> 1. Secure Federated cluster is available
> 2. Add the below configuration in Router and client core-site.xml
> hadoop.security.authorization= true 
> 3. Restart the router service
> 【Test step】:
> 1. Go to router client bin path and submit a MR PI job
> 2. Observe the client console screen
> 【Expect Output】:
> No error should be thrown and Job should be successful
> 【Actual Output】:
> Job failed prompting "Protocol interface 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB is not known.,"
> 【Additional Note】:
>  But on setting the parameter as false, job is submitted and success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10098) Add interface to get node iterators by scheduler key for AppPlacementAllocator

2020-01-23 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-10098.
---
Resolution: Invalid

> Add interface to get node iterators by scheduler key for AppPlacementAllocator
> --
>
> Key: YARN-10098
> URL: https://issues.apache.org/jira/browse/YARN-10098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10098) Add interface to get node iterators by scheduler key for AppPlacementAllocator

2020-01-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10098:
--
Summary: Add interface to get node iterators by scheduler key for 
AppPlacementAllocator  (was:  AppPlacementAllocator getPreferredNodeIterator 
based on scheduler key)

> Add interface to get node iterators by scheduler key for AppPlacementAllocator
> --
>
> Key: YARN-10098
> URL: https://issues.apache.org/jira/browse/YARN-10098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10098) AppPlacementAllocator getPreferredNodeIterator based on scheduler key

2020-01-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-10098:
--
Summary:  AppPlacementAllocator getPreferredNodeIterator based on scheduler 
key  (was:  AppPlacementAllocator get getPreferredNodeIterator based on 
scheduler key)

>  AppPlacementAllocator getPreferredNodeIterator based on scheduler key
> --
>
> Key: YARN-10098
> URL: https://issues.apache.org/jira/browse/YARN-10098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10098) AppPlacementAllocator get getPreferredNodeIterator based on scheduler key

2020-01-22 Thread Bibin Chundatt (Jira)
Bibin Chundatt created YARN-10098:
-

 Summary:  AppPlacementAllocator get getPreferredNodeIterator based 
on scheduler key
 Key: YARN-10098
 URL: https://issues.apache.org/jira/browse/YARN-10098
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin Chundatt






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-01-22 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reassigned YARN-4575:


Assignee: (was: Bibin Chundatt)

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9624) Use switch case for ProtoUtils#convertFromProtoFormat containerState

2020-01-07 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009536#comment-17009536
 ] 

Bibin Chundatt commented on YARN-9624:
--

[~BilwaST] Could you update the patch ?

> Use switch case for ProtoUtils#convertFromProtoFormat containerState
> 
>
> Key: YARN-9624
> URL: https://issues.apache.org/jira/browse/YARN-9624
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Bilwa S T
>Priority: Major
>  Labels: performance
> Attachments: YARN-9624.001.patch, YARN-9624.002.patch
>
>
> On large cluster with 100K+ containers on every heartbeat 
> {{ContainerState.valueOf(e.name().replace(CONTAINER_STATE_PREFIX, ""))}} will 
> be too costly. Update with switch case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-11-11 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972097#comment-16972097
 ] 

Bibin Chundatt commented on YARN-9697:
--

Thank you [~abmodi]

Overall patch looks good to  me..

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, 
> YARN-9697.009.patch, YARN-9697.ut.patch, YARN-9697.ut2.patch, 
> YARN-9697.wip1.patch, YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-11-10 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971342#comment-16971342
 ] 

Bibin Chundatt edited comment on YARN-9697 at 11/11/19 7:03 AM:


[~abmodi]

Few minor Nits:

# NodeQueueLoadMonitor  following set is not required , already getting set in 
constructor . 
{code}
  private int numNodesForAnyAllocation =
  DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED;
{code}
Set to zero should be fine.
# EnrichedResourceRequest : rename methods since we are returning maps now.
Improvement:
# CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey :  Can 
you maintain a metrics to avoid iterating through allocations for each 
scheduler key
{noformat}
152 for (List allocs : allocations.values()) {
153   totalAllocated += allocs.size();
154 }
{noformat}


was (Author: bibinchundatt):
[~abmodi]

Few minor Nits:

# NodeQueueLoadMonitor  following set is not required , already getting set in 
constructor
{code}
  private int numNodesForAnyAllocation =
  DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED;
{code}
# EnrichedResourceRequest : rename methods since we are returning maps now.
Improvement:
# CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey :  Can 
you maintain a metrics to avoid iterating through allocations for each 
scheduler key
{noformat}
152 for (List allocs : allocations.values()) {
153   totalAllocated += allocs.size();
154 }
{noformat}

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, 
> YARN-9697.ut.patch, YARN-9697.ut2.patch, YARN-9697.wip1.patch, 
> YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9697) Efficient allocation of Opportunistic containers.

2019-11-10 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971342#comment-16971342
 ] 

Bibin Chundatt commented on YARN-9697:
--

[~abmodi]

Few minor Nits:

# NodeQueueLoadMonitor  following set is not required , already getting set in 
constructor
{code}
  private int numNodesForAnyAllocation =
  DEFAULT_OPP_CONTAINER_ALLOCATION_NODES_NUMBER_USED;
{code}
# EnrichedResourceRequest : rename methods since we are returning maps now.
Improvement:
# CentralizedOpportunisticContainerAllocator # allocatePerSchedulerKey :  Can 
you maintain a metrics to avoid iterating through allocations for each 
scheduler key
{noformat}
152 for (List allocs : allocations.values()) {
153   totalAllocated += allocs.size();
154 }
{noformat}

> Efficient allocation of Opportunistic containers.
> -
>
> Key: YARN-9697
> URL: https://issues.apache.org/jira/browse/YARN-9697
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9697.001.patch, YARN-9697.002.patch, 
> YARN-9697.003.patch, YARN-9697.004.patch, YARN-9697.005.patch, 
> YARN-9697.006.patch, YARN-9697.007.patch, YARN-9697.008.patch, 
> YARN-9697.ut.patch, YARN-9697.ut2.patch, YARN-9697.wip1.patch, 
> YARN-9697.wip2.patch
>
>
> In the current implementation, opportunistic containers are allocated based 
> on the number of queued opportunistic container information received in node 
> heartbeat. This information becomes stale as soon as more opportunistic 
> containers are allocated on that node.
> Allocation of opportunistic containers happens on the same heartbeat in which 
> AM asks for the containers. When multiple applications request for 
> Opportunistic containers, containers might get allocated on the same set of 
> nodes as already allocated containers on the node are not considered while 
> serving requests from different applications. This can lead to uneven 
> allocation of Opportunistic containers across the cluster leading to 
> increased queuing time 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962888#comment-16962888
 ] 

Bibin Chundatt edited comment on YARN-9940 at 10/30/19 2:04 PM:


[~kailiu_dev]

Apologies i thought issue is duplicate of YARN-8436 and you have closed  based 
on that.
Fixed and resolved state are set only if the changes has gone into 3.2.0. 

If tats is not the case we have to keep the issue open .

Please refer : 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

Reopening the issue 










was (Author: bibinchundatt):
[~kailiu_dev]

Apologies i thought issue is duplicate of YARN-8436 and you have close due to 
that.
Fixed and resolved is only if the changes has gone into 3.2.0.  Its that is not 
the case we have to keep the issue open.

Please refer : 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
Reopening the issue 









> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9940:
-
Fix Version/s: (was: 3.2.0)

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-9940:
--

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-30 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962888#comment-16962888
 ] 

Bibin Chundatt commented on YARN-9940:
--

[~kailiu_dev]

Apologies i thought issue is duplicate of YARN-8436 and you have close due to 
that.
Fixed and resolved is only if the changes has gone into 3.2.0.  Its that is not 
the case we have to keep the issue open.

Please refer : 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
Reopening the issue 









> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-29 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt resolved YARN-9940.
--
Target Version/s:   (was: 2.7.2)
  Resolution: Duplicate

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-10-29 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt reopened YARN-9940:
--

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: 0001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961033#comment-16961033
 ] 

Bibin Chundatt edited comment on YARN-2442 at 10/28/19 8:32 PM:


Thank you [~cyrusjackson25] for updated patch.

Over all the patch looks good to me . +1  for YARN-2443.004.patch . Will wait 
for a day for others to take a look.

cc:// [~rohithsharma]




was (Author: bibinchundatt):
Thank you [~cyrusjackson25] for updated patch.

Over all the patch looks good to me . +1  for YARN-2443.003.patch . Will wait 
for a day for others to take a look.

cc:// [~rohithsharma]



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.004.patch, YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-28 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961033#comment-16961033
 ] 

Bibin Chundatt commented on YARN-2442:
--

Thank you [~cyrusjackson25] for updated patch.

Over all the patch looks good to me . +1  for YARN-2443.003.patch . Will wait 
for a day for others to take a look.

cc:// [~rohithsharma]



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.004.patch, YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9935) SSLHandshakeException thrown when HTTPS is enabled in AM web server in one certain condition

2019-10-26 Thread Bibin Chundatt (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin Chundatt updated YARN-9935:
-
Component/s: (was: amrmproxy)

> SSLHandshakeException thrown when HTTPS is enabled in AM web server in one 
> certain condition
> 
>
> Key: YARN-9935
> URL: https://issues.apache.org/jira/browse/YARN-9935
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sushanta Sen
>Priority: Major
>
> 【Precondition】:
> 1. Install the cluster
> 2. *{color:#4C9AFF}WebAppProxyServer service installed in 1 VM and RMs 
> installed in 2 VMs{color}*
> 3. Enables all the HTTPS configuration required 
> yarn.resourcemanager.application-https.policy
> STRICT
> yarn.app.mapreduce.am.webapp.https.enabled
> true
> yarn.app.mapreduce.am.webapp.https.client.auth
> true
> 4. RM HA enabled
> 5. *{color:#4C9AFF}Active RM is running in VM2, standby in VM1{color}*
> 6. Cluster should be up and running
> 【Test step】:
> 1.Submit an application
> 2. Open Application Master link from the applicationID from RM UI
> 【Expect Output】:
> No error should be thrown and JOb should be successful
> 【Actual Output】:
> SSLHandshakeException thrown , although Job is successful.
> "javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957644#comment-16957644
 ] 

Bibin Chundatt edited comment on YARN-2442 at 10/23/19 8:16 AM:


Thank you  [~cyrusjackson25] for working on the patch

# Currently RMInfo is holding the reference of RMContext which could lead to 
memory leak on switch over. Instead we could use ResourceManager instance 
directly.
# Fix the checkstyle issues
# Findbug issue seems to be already fix.




was (Author: bibinchundatt):
Thank you  [~cyrusjackson25] for working on the patch

Currently RMInfo is holding the reference of RMContext which could lead to 
memory leak on switch over. Instead we could use ResourceManager object 
directly.



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2442) ResourceManager JMX UI does not give HA State

2019-10-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957644#comment-16957644
 ] 

Bibin Chundatt commented on YARN-2442:
--

Thank you  [~cyrusjackson25] for working on the patch

Currently RMInfo is holding the reference of RMContext which could lead to 
memory leak on switch over. Instead we could use ResourceManager object 
directly.



> ResourceManager JMX UI does not give HA State
> -
>
> Key: YARN-2442
> URL: https://issues.apache.org/jira/browse/YARN-2442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.5.0, 2.6.0, 2.7.0
>Reporter: Nishan Shetty
>Assignee: Rohith Sharma K S
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-2442.patch, YARN-2442.003.patch, 
> YARN-2442.02.patch
>
>
> ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
> STOPPED)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >