[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2019-10-15 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952049#comment-16952049
 ] 

Tao Yang commented on YARN-8737:


Thanks [~cheersyang] for the review. Submitted already.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2019-10-14 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951552#comment-16951552
 ] 

Tao Yang commented on YARN-8737:


Thanks [~Amithsha] for the feedback. Sorry to have forgot this issue for a long 
time.

[~cheersyang] & [~sunilg], Could you please help to review the patch?

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage

2019-10-13 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950671#comment-16950671
 ] 

Tao Yang edited comment on YARN-9838 at 10/14/19 3:17 AM:
--

Thanks [~jiulongZhu] for updating the patch.

LGTM, +1 for the patch. Last small suggestion is to add a blank line before the 
new test case,  which I can directly update before committing.

I will commit this if no further comments from others after a few days.


was (Author: tao yang):
Thanks [~jiulongZhu] for updating the patch.

LGTM, +1 for the patch. Last small suggestion is to add a blank line before the 
new test case.

I will commit this if no further comments from others after a few days.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri

2019-10-13 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950671#comment-16950671
 ] 

Tao Yang commented on YARN-9838:


Thanks [~jiulongZhu] for updating the patch.

LGTM, +1 for the patch. Last small suggestion is to add a blank line before the 
new test case.

I will commit this if no further comments from others after a few days.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage

2019-10-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949330#comment-16949330
 ] 

Tao Yang edited comment on YARN-9838 at 10/11/19 10:02 AM:
---

Thanks [~jiulongZhu] for fixing this issue. 
The patch LGTM in general,  some minor suggestions for the patch:
* check-style warnings need to be fixed, after that, you can run 
"dev-support/bin/test-patch /path/to/my.patch" to confirm.
* The indentation of updated log need to be adjusted and useless deletion of a 
blank line should be reverted in LeafQueue.
* The annotation "sync ResourceUsageByLabel ResourceUsageByUser and 
numContainer" can be removed since it seems unnecessary to add details here.
* As for UT, you can remove before-fixed block and just keep the correct 
verification.  Moreover, I think it's better to remove the method 
annotation("//YARN-9838") since we can find the source easily by git, and the 
annotation style "/\*\* \*/" often used for class or method, it's better to use 
"//" or "/\* \*/" in the method.


was (Author: tao yang):
Thanks [~jiulongZhu] for fixing this issue. 
The patch is LGTM in general,  some minor suggestions for the patch:
* check-style warnings need to be fixed, after that, you can run 
"dev-support/bin/test-patch /path/to/my.patch" to confirm.
* The indentation of updated log need to be adjusted and useless deletion of a 
blank line should be reverted in LeafQueue.
* The annotation "sync ResourceUsageByLabel ResourceUsageByUser and 
numContainer" can be removed since it seems unnecessary to add details here.
* As for UT, you can remove before-fixed block and just keep the correct 
verification.  Moreover, I think it's better to remove "//YARN-9838" since we 
can find the source easily by git, and the annotation style "/** */" often used 
for class or method, it's better to use "//" or "/* */" in the method.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics

2019-10-11 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9838:
---
Issue Type: Bug  (was: Improvement)

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics

2019-10-11 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9838:
---
Fix Version/s: (was: 2.7.3)

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri

2019-10-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949330#comment-16949330
 ] 

Tao Yang commented on YARN-9838:


Thanks [~jiulongZhu] for fixing this issue. 
The patch is LGTM in general,  some minor suggestions for the patch:
* check-style warnings need to be fixed, after that, you can run 
"dev-support/bin/test-patch /path/to/my.patch" to confirm.
* The indentation of updated log need to be adjusted and useless deletion of a 
blank line should be reverted in LeafQueue.
* The annotation "sync ResourceUsageByLabel ResourceUsageByUser and 
numContainer" can be removed since it seems unnecessary to add details here.
* As for UT, you can remove before-fixed block and just keep the correct 
verification.  Moreover, I think it's better to remove "//YARN-9838" since we 
can find the source easily by git, and the annotation style "/** */" often used 
for class or method, it's better to use "//" or "/* */" in the method.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.3
>
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.

2019-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924672#comment-16924672
 ] 

Tao Yang edited comment on YARN-8995 at 9/7/19 12:33 AM:
-

Thanks [~jhung] for fixing this problem, sorry for missing changes about logger 
class in branch-3.1 and branch-3.2. 
Failures in jenkins report are cased by running environment, unrelated to the 
patch.
Patch LGTM and already tested in my local environment. Committing shortly.


was (Author: tao yang):
Thanks [~jhung] for fixing this problem, sorry for missing changes about logger 
class in branch-3.1.
Patch LGTM and already tested in my local environment. Committing shortly.

> Log events info in AsyncDispatcher when event queue size cumulatively reaches 
> a certain number every time.
> --
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: TestStreamPerf.java, 
> YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, 
> image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.

2019-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924672#comment-16924672
 ] 

Tao Yang commented on YARN-8995:


Thanks [~jhung] for fixing this problem, sorry for missing changes about logger 
class in branch-3.1.
Patch LGTM and already tested in my local environment. Committing shortly.

> Log events info in AsyncDispatcher when event queue size cumulatively reaches 
> a certain number every time.
> --
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: TestStreamPerf.java, 
> YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, 
> image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9817) Fix failing testcases due to not initialized AsyncDispatcher - ArithmeticException: / by zero

2019-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924659#comment-16924659
 ] 

Tao Yang commented on YARN-9817:


Thanks [~Prabhu Joseph] for raising this issue. 
Patch LGTM, committing now...

> Fix failing testcases due to not initialized AsyncDispatcher -  
> ArithmeticException: / by zero
> --
>
> Key: YARN-9817
> URL: https://issues.apache.org/jira/browse/YARN-9817
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.3.0, 3.2.1, 3.1.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9817-001.patch
>
>
> Below testcases failing as Asyncdispatcher throws ArithmeticException: / by 
> zero
> {code}
>  hadoop.mapreduce.v2.app.TestRuntimeEstimators 
>  hadoop.mapreduce.v2.app.job.impl.TestJobImpl 
>  hadoop.mapreduce.v2.app.TestMRApp 
> {code}
> Error Message:
> {code}
> [ERROR] testUpdatedNodes(org.apache.hadoop.mapreduce.v2.app.TestMRApp)  Time 
> elapsed: 0.847 s  <<< ERROR!
> java.lang.ArithmeticException: / by zero
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
>   at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1015)
>   at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:141)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1544)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1263)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at org.apache.hadoop.mapreduce.v2.app.MRApp.submit(MRApp.java:301)
>   at org.apache.hadoop.mapreduce.v2.app.MRApp.submit(MRApp.java:285)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRApp.testUpdatedNodes(TestMRApp.java:223)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> This happens when AsyncDispatcher is not initialized in the testcases and so 
> detailsInterval is taken as 0.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-05 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923891#comment-16923891
 ] 

Tao Yang commented on YARN-9795:


+1 for the latest patch.
I will commit this if no further comments from others.

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch, YARN-9795.002.patch, 
> YARN-9795.003.patch, YARN-9795.004.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-05 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923882#comment-16923882
 ] 

Tao Yang commented on YARN-9795:


Thanks [~fengnanli] for the update. A small suggestion is to remove null 
initial value for aMContainerAllocationDelay since it seems redundant.  Make 
sense?

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch, YARN-9795.002.patch, 
> YARN-9795.003.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923024#comment-16923024
 ] 

Tao Yang commented on YARN-9795:


Thanks [~fengnanli] for this improvement.
Patch almost LGTM,  IMO, there's no need to set -1 as the initial value of 
scheduledTime and add the special annotation, 0 should be the proper initial 
value like other times.  And new check-style warnings should be fixed as well.

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922996#comment-16922996
 ] 

Tao Yang commented on YARN-8995:


Hi, [~zhuqi], I found another place need to be improved.  {{ if (qSize % 
detailsInterval == 0) }} should be updated to {{ if (qSize != 0 && qSize % 
detailsInterval == 0 && lastEventDetailsQueueSizeLogged != qSize )}}, avoid 
printing for empty queue and print details redundantly. 

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922279#comment-16922279
 ] 

Tao Yang commented on YARN-8995:


Confirmed that latest patch should not fail like that. 
Now the patch LGTM, waiting for feedbacks from [~cheersyang], thanks.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921981#comment-16921981
 ] 

Tao Yang commented on YARN-8995:


Hi, [~zhuqi]. I noticed TestAsyncDispatcher#testPrintDispatcherEventDetails 
which is added by this patch failed 2 days ago, can you confirm why this 
happened? Even through it didn't happen again, I'm still afraid it may fail 
intermittently.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-01 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920568#comment-16920568
 ] 

Tao Yang commented on YARN-8995:


Thanks [~zhuqi] for the update.
Patch LGTM, could you please also fix the remaining check-style warnings? 
Hi, [~cheersyang], please help to review again, are these changes ok to you?

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-30 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919658#comment-16919658
 ] 

Tao Yang commented on YARN-9540:


Thanks [~abmodi], [~adam.antal] for the review and commit.

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-30 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919654#comment-16919654
 ] 

Tao Yang commented on YARN-9798:


Thanks [~abmodi] for the review. 
The frequency is only 1 or 2 failures in 2000 runs, and it didn't happen again 
after this fix.

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> 

[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-30 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9798:
---
Attachment: (was: YARN-9798.001.patch)

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401)
>   at 
> 

[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-30 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9798:
---
Attachment: YARN-9798.001.patch

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401)
>   at 
> 

[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919204#comment-16919204
 ] 

Tao Yang commented on YARN-9714:


Thanks [~rohithsharma], [~bibinchundatt] for the review and commit!

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9803) NPE while accessing Scheduler UI

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang resolved YARN-9803.

Resolution: Duplicate

Hi, [~yifan.stan]. This is a duplicate of YARN-9685, closing it as duplicate.

> NPE while accessing Scheduler UI
> 
>
> Key: YARN-9803
> URL: https://issues.apache.org/jira/browse/YARN-9803
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Xie YiFan
>Assignee: Xie YiFan
>Priority: Major
> Attachments: YARN-9803-branch-3.1.1.001.patch
>
>
> The same with what described in YARN-4624
> Scenario:
>  ===
> if not configure all queue's capacity to nodelabel even the value is 0, start 
> cluster and access capacityscheduler page.
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$LI.__(Hamlet.java:7709)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:342)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$LI.__(Hamlet.java:7709)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:513)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919097#comment-16919097
 ] 

Tao Yang edited comment on YARN-9540 at 8/30/19 2:00 AM:
-

Hi, [~adam.antal]. 
The cause is that the assertion which will make sure dispatcher have handled 
event but there is no wait before this assertion, we need to add 
{{rmDispatcher.await()}} like others in TestRMAppTransitions to fix this issue.
In my local test, about 5+ failures may happened in 1000 runs. After applying 
the patch, I didn't see it again.


was (Author: tao yang):
Hi, [~adam.antal]. 
The cause is that the assertion which will make sure dispatcher have handled 
event but not wait, we need to add {{rmDispatcher.await()}} before that 
assertion like others in TestRMAppTransitions to fix this issue.
In my local test, about 5+ failures may happened in 1000 runs. After applying 
the patch, I didn't see it again.

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> 

[jira] [Commented] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919097#comment-16919097
 ] 

Tao Yang commented on YARN-9540:


Hi, [~adam.antal]. 
The cause is that the assertion which will make sure dispatcher have handled 
event but not wait, we need to add {{rmDispatcher.await()}} before that 
assertion like others in TestRMAppTransitions to fix this issue.
In my local test, about 5+ failures may happened in 1000 runs. After applying 
the patch, I didn't see it again.

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira

[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918511#comment-16918511
 ] 

Tao Yang commented on YARN-9664:


Thanks [~cheersyang] for the review and commit!

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch, 
> YARN-9664.003.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918510#comment-16918510
 ] 

Tao Yang commented on YARN-9538:


Thanks [~cheersyang] for reminding me, I will do that later.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9540:
---
Attachment: YARN-9540.001.patch

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang reassigned YARN-9540:
--

Assignee: Tao Yang  (was: Prabhu Joseph)

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9799) TestRMAppTransitions#testAppFinishedFinished fails intermittently

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918506#comment-16918506
 ] 

Tao Yang commented on YARN-9799:


Thanks [~Prabhu Joseph] for reminding me, I'll fix this issue over there.

> TestRMAppTransitions#testAppFinishedFinished fails intermittently
> -
>
> Key: YARN-9799
> URL: https://issues.apache.org/jira/browse/YARN-9799
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9799.001.patch
>
>
> Found intermittent failure of TestRMAppTransitions#testAppFinishedFinished in 
> YARN-9664 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled APP_COMPLETED event but not wait, we need to add 
> {{rmDispatcher.await()}} before that assertion like others in this class to 
> fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918469#comment-16918469
 ] 

Tao Yang commented on YARN-9664:


Hi, [~cheersyang].
{quote}
UT seems not related to this patch, Tao Yang, could you please confirm?
{quote}
Yes, it's not related to this patch, I have created YARN-9799 to fix it. Thanks.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch, 
> YARN-9664.003.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9799) TestRMAppTransitions#testAppFinishedFinished fails intermittently

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9799:
---
Attachment: YARN-9799.001.patch

> TestRMAppTransitions#testAppFinishedFinished fails intermittently
> -
>
> Key: YARN-9799
> URL: https://issues.apache.org/jira/browse/YARN-9799
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9799.001.patch
>
>
> Found intermittent failure of TestRMAppTransitions#testAppFinishedFinished in 
> YARN-9664 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled APP_COMPLETED event but not wait, we need to add 
> {{rmDispatcher.await()}} before that assertion like others in this class to 
> fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9799) TestRMAppTransitions#testAppFinishedFinished fails intermittently

2019-08-29 Thread Tao Yang (Jira)
Tao Yang created YARN-9799:
--

 Summary: TestRMAppTransitions#testAppFinishedFinished fails 
intermittently
 Key: YARN-9799
 URL: https://issues.apache.org/jira/browse/YARN-9799
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Tao Yang
Assignee: Tao Yang


Found intermittent failure of TestRMAppTransitions#testAppFinishedFinished in 
YARN-9664 jenkins report, the cause is that the assertion which will make sure 
dispatcher has handled APP_COMPLETED event but not wait, we need to add 
{{rmDispatcher.await()}} before that assertion like others in this class to fix 
this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918446#comment-16918446
 ] 

Tao Yang commented on YARN-9714:


There is an intermittent UT failure in the latest jenkins report, I have 
created YARN-9798 to fix it.

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9798:
---
Attachment: YARN-9798.001.patch

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401)
>   at 
> 

[jira] [Created] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-29 Thread Tao Yang (Jira)
Tao Yang created YARN-9798:
--

 Summary: 
ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
intermittently
 Key: YARN-9798
 URL: https://issues.apache.org/jira/browse/YARN-9798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Tao Yang
Assignee: Tao Yang


Found intermittent failure of 
ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
YARN-9714 jenkins report, the cause is that the assertion which will make sure 
dispatcher has handled UNREGISTERED event but not wait until all events in 
dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
assertion to fix this issue.

Failure info:
{noformat}
[ERROR] 
testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
  Time elapsed: 0.559 s  <<< FAILURE!
java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{noformat}
Standard output:
{noformat}
2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error in 
handling event type REGISTERED for applicationAttempt 
appattempt_1567061994047_0001_01
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
at 
org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401)
at 
org.apache.hadoop.yarn.event.DrainDispatcher$1.run(DrainDispatcher.java:76)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at 

[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918321#comment-16918321
 ] 

Tao Yang commented on YARN-9664:


Thanks [~cheersyang] for the advice. Attached v3 patch.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch, 
> YARN-9664.003.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9664:
---
Attachment: YARN-9664.003.patch

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch, 
> YARN-9664.003.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-28 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918273#comment-16918273
 ] 

Tao Yang commented on YARN-9714:


Hi, [~rohithsharma]. UT log is filled with these errors: 
"java.lang.OutOfMemoryError: unable to create new native thread", perhaps 
threads were exhausted at that time on one of jenkins nodes. Could you please 
tell me how to retriger jenkins without updating the patch or status?

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-28 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918247#comment-16918247
 ] 

Tao Yang commented on YARN-9664:


Thanks [~cheersyang] for the review.
{quote}
ActivitiesUtils  Line 56: I noticed that the 1st filter is to filter out null 
objects
{quote}
Aim of that is to filter out node-level activities instead of null objects, we 
use {{e.getNodeId() != null}} since only node-level activities has non-null 
nodeIds. 
{quote}
what does "single placement node" mean here?
{quote}
"single placement node" means this scheduling process is based on a single 
node, I want to use it to distinguish from multi-nodes placement scenarios, 
however it seems not suitable, I'll be glad if you have better description for 
it.
{quote}
"Node skipped because of no off-switch and locality violation"
I am also not quite sure what does this mean, can you please elaborate?
{quote}
It means request have only node_local type or rack_local type but no off-switch 
type, and node/rack locality can't be satisfied.
{quote}
line 650: is it safe to the check: "if (node != null && !isReserved)" here?
{quote}
I think there is no need to add the check above. No matter whether node is null 
and what type is the assignment, activities which is required should be 
finished when reaching here.
Others are fine for me, I will update the patch after all points above are 
confirmed. Thanks.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-28 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917516#comment-16917516
 ] 

Tao Yang commented on YARN-9664:


Hi, [~cheersyang], it indeed changes a lot and most of them are state/info 
improvements, I think most output of these changes are expected but maybe some 
are still need to be improved, please feel free to give your advice, thanks.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-28 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917507#comment-16917507
 ] 

Tao Yang commented on YARN-8995:


Hi, [~zhuqi]. The latest patch seems not applicable for trunk now, could you 
please rebase and update it?
The latest patch has two places need to be updated or confirmed:
1. The prefix of YARN_DISPATCHER_PRINT_EVENTS_INFO_THRESHOLD is "yarn.yarn."
2. Why need this update: LOG.fatal("Error in dispatcher thread", t) --> 
LOG.error(FATAL, "Error in dispatcher thread", t) ?

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-28 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9714:
---
Attachment: YARN-9714.005.patch

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-28 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917482#comment-16917482
 ] 

Tao Yang commented on YARN-9714:


{quote}
Instead of comparing, how about checking for resourceManager.getZKManager() == 
null? This basically sync the code where zkManager initialization to closing it.
{quote}
Make sense to me. Attached v5 patch for this, thanks!

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9714:
---
Attachment: YARN-9714.004.patch

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916291#comment-16916291
 ] 

Tao Yang commented on YARN-8917:


Thanks [~rohithsharma], [~leftnoteasy], [~sunilg] for the review and commit!

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-8917.001.patch, YARN-8917.002.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916290#comment-16916290
 ] 

Tao Yang commented on YARN-9714:


TestZKRMStateStore#testZKRootPathAcls UT failure is caused by itself, 
stateStore (ZKRMStateStore instance) used for verification is not updated after 
RM HA transition. Will attach v4 patch to fix this UT problem.

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915685#comment-16915685
 ] 

Tao Yang commented on YARN-8193:


Hi, [~sunilg], [~leftnoteasy]. 
Any updates or plans about this fix on branch-2.x?  YARN-9779 seems to be the 
same issue.

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9779) NPE while allocating a container

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915684#comment-16915684
 ] 

Tao Yang commented on YARN-9779:


Sorry for the late reply. I think this issue is duplicate with YARN-8193.

> NPE while allocating a container
> 
>
> Key: YARN-9779
> URL: https://issues.apache.org/jira/browse/YARN-9779
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Getting the following exception while allocating a container 
>  
> 2019-08-22 23:59:20,180 FATAL event.EventDispatcher (?:?(?)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:301)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1346)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1341)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151)
>  at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-08-22 23:59:20,180 INFO  rmcontainer.RMContainerImpl (?:?(?)) - 
> container_e2364_1565770624228_198773_01_000946 Container Transitioned from 
> ALLOCATED to ACQUIRED
> 2019-08-22 23:59:20,180 INFO  event.EventDispatcher (?:?(?)) - Exiting, bbye..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-9779) NPE while allocating a container

2019-08-26 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9779:
---
Comment: was deleted

(was: Sorry for the late reply. I think this issue is duplicate with YARN-8193.)

> NPE while allocating a container
> 
>
> Key: YARN-9779
> URL: https://issues.apache.org/jira/browse/YARN-9779
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Getting the following exception while allocating a container 
>  
> 2019-08-22 23:59:20,180 FATAL event.EventDispatcher (?:?(?)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:301)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1346)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1341)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151)
>  at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-08-22 23:59:20,180 INFO  rmcontainer.RMContainerImpl (?:?(?)) - 
> container_e2364_1565770624228_198773_01_000946 Container Transitioned from 
> ALLOCATED to ACQUIRED
> 2019-08-22 23:59:20,180 INFO  event.EventDispatcher (?:?(?)) - Exiting, bbye..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9714:
---
Attachment: YARN-9714.003.patch

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915628#comment-16915628
 ] 

Tao Yang commented on YARN-8917:


Hi, [~rohithsharma], I can't reproduce this failure on trunk branch in my local 
environment, according to the standard output, the assignment which should 
allocate a container was broke for unknown reason after checking node, debug 
info is not enough to locate the cause. Moreover, this patch only affect the 
capacity shown in WebUI or REST results, won't affect the scheduling process. 
So I think they are not related.

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8917.001.patch, YARN-8917.002.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915625#comment-16915625
 ] 

Tao Yang commented on YARN-9714:


We can close the connection after checking {{zkManager != 
resourceManager.getZKManager()}}, 
when resourceManager.getZKManager() is null, it will make sure zkManager is not 
null then close.
when resourceManager.getZKManager() is not null and they are not the same one, 
it will close too.
I think it's a more direct way, make sense?

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915598#comment-16915598
 ] 

Tao Yang edited comment on YARN-9714 at 8/26/19 8:45 AM:
-

Hi, [~rohithsharma].
I have commented (over 
[here|https://issues.apache.org/jira/browse/YARN-9714?focusedCommentId=16896704=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16896704])
 for this: "As for zkManager in ZKStateStore, it will reuse zkManager for HA 
when RM uses the Curator-based elector for leader election, otherwise it will 
be created for ZKRMStateStore".  Please refer to 
{{ResourceManager#createEmbeddedElector}} and {{ZKRMStateStore#initInternal}} 
for details.


was (Author: tao yang):
Hi, [~rohithsharma].

I have commented (over here) for this: "As for zkManager in ZKStateStore, it 
will reuse zkManager for HA when RM uses the Curator-based elector for leader 
election, otherwise it will be created for ZKRMStateStore".  Please refer to 
{{ResourceManager#createEmbeddedElector}} and {{ZKRMStateStore#initInternal}} 
for details.

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915598#comment-16915598
 ] 

Tao Yang edited comment on YARN-9714 at 8/26/19 8:43 AM:
-

Hi, [~rohithsharma].

I have commented (over here) for this: "As for zkManager in ZKStateStore, it 
will reuse zkManager for HA when RM uses the Curator-based elector for leader 
election, otherwise it will be created for ZKRMStateStore".  Please refer to 
{{ResourceManager#createEmbeddedElector}} and {{ZKRMStateStore#initInternal}} 
for details.


was (Author: tao yang):
Hi, [~rohithsharma].

I have commented (over 
[here|https://issues.apache.org/jira/browse/YARN-9714?focusedCommentId=16896704=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16896704])
 for this: "As for zkManager in ZKStateStore, it will reuse zkManager for HA 
when RM uses the Curator-based elector for leader election, otherwise it will 
be created for ZKRMStateStore". 

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915598#comment-16915598
 ] 

Tao Yang commented on YARN-9714:


Hi, [~rohithsharma].

I have commented (over 
[here|https://issues.apache.org/jira/browse/YARN-9714?focusedCommentId=16896704=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16896704])
 for this: "As for zkManager in ZKStateStore, it will reuse zkManager for HA 
when RM uses the Curator-based elector for leader election, otherwise it will 
be created for ZKRMStateStore". 

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915522#comment-16915522
 ] 

Tao Yang commented on YARN-9714:


Thanks [~rohithsharma] for the comment. We are using hadoop-3.1 right now.  
IIUIC, ZK connection can be closed in ZKRMStateStore#closeInternal only for 
non-HA mode, we found this problem when HA is enabled which is also described 
in the title "... leaks after RM transitioned to standby", I will move the fix 
logic into ZKRMStateStore#closeInternal. Make sense?

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2019-08-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915511#comment-16915511
 ] 

Tao Yang commented on YARN-8917:


Sorry for this late reply. I will check it shortly.

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8917.001.patch, YARN-8917.002.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-23 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914048#comment-16914048
 ] 

Tao Yang commented on YARN-8995:


Thanks [~cheersyang] for the review.
I agree that 'in-thousands' should be removed from the configuration name, we 
should also make this log independent to avoid skipping this forever when it's 
configured with an invalid value which isn't dividable by 1000, for example, 
move {{logEventDistribution()}} out from the if-block within queue-size log and 
add a new if-block for it.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-19 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910932#comment-16910932
 ] 

Tao Yang commented on YARN-8995:


Thanks [~zhuqi] for this contribution, you are welcome to contribute more in 
our community.
I would like to hear thoughts from others before committing this, 
[~cheersyang], could you please help to review this patch and give some advice?

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9716) AM container might leak

2019-08-08 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903526#comment-16903526
 ] 

Tao Yang commented on YARN-9716:


Hi, [~vinodkv], could you please take a look at this issue? 

> AM container might leak
> ---
>
> Key: YARN-9716
> URL: https://issues.apache.org/jira/browse/YARN-9716
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> There is a risk that AM container might leak when NM exits unexpected 
> meanwhile AM container is localizing if AM expiry interval (conf-key: 
> yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval 
> (conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
>  RMAppAttempt state changes as follows:
> {noformat}
> LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition) 
>  --> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition / 
> ExpiredTransition: send AMLauncherEventType.CLEANUP )  --> FAILED
> {noformat}
> AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which 
> internally call ContainerManagementProtocol#stopContainer to stop AM 
> container via communicating with NM, if NM can't be connected, it just skip 
> it without any logs.
> I think in this case we can complete the AM container in scheduler when 
> failed to stop it, so that it will have a chance to be stopped when NM 
> reconnects with RM. 
>  Hope to hear your thoughts? Thank you!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions

2019-08-08 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903501#comment-16903501
 ] 

Tao Yang commented on YARN-9685:


Thanks [~eepayne] for the review and commit!

> NPE when rendering the info table of leaf queue in non-accessible partitions
> 
>
> Key: YARN-9685
> URL: https://issues.apache.org/jira/browse/YARN-9685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9685.001.patch
>
>
> I found incomplete queue info shown on scheduler page and NPE in RM log when 
> rendering the info table of leaf queue in non-accessible partitions.
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> {noformat}
> The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in 
> non-accessible partitions are incomplete(part of fields are null such as 
> configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource)
>  but some places in CapacitySchedulerPage don't consider that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9716) AM container might leak

2019-07-31 Thread Tao Yang (JIRA)
Tao Yang created YARN-9716:
--

 Summary: AM container might leak
 Key: YARN-9716
 URL: https://issues.apache.org/jira/browse/YARN-9716
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.3.0
Reporter: Tao Yang
Assignee: Tao Yang


There is a risk that AM container might leak when NM exits unexpected meanwhile 
AM container is localizing if AM expiry interval (conf-key: 
yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval 
(conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
 RMAppAttempt state changes as follows:
{noformat}
LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition) 
 --> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition / 
ExpiredTransition: send AMLauncherEventType.CLEANUP )  --> FAILED
{noformat}
AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which 
internally call ContainerManagementProtocol#stopContainer to stop AM container 
via communicating with NM, if NM can't be connected, it just skip it without 
any logs.

I think in this case we can complete the AM container in scheduler when failed 
to stop it, so that it will have a chance to be stopped when NM reconnects with 
RM. 
 Hope to hear your thoughts? Thank you!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896704#comment-16896704
 ] 

Tao Yang edited comment on YARN-9714 at 7/31/19 2:29 AM:
-

Hi, [~bibinchundatt].
{quote}IIUC the zookeer StateStore is not an active service and zookeeper 
connection is common for leader election too.
 Do we really need to close the connection ??
{quote}
RMStateStore is an active service which will be created for every 
RMActiveServices instance. As for zkManager in ZKStateStore, it will reuse 
zkManager for HA when RM uses the Curator-based elector for leader election, 
otherwise it will be created for ZKRMStateStore, so that we should only close 
it when it's not for HA in ZKRMStateStore#serviceStop. Make sense?
{quote}
Few other issues in 3.1.1 which got fixed recently are YARN-9644,9639
{quote}
Thanks, I noticed those issues before but just missed YARN-9642 :(
 


was (Author: tao yang):
Hi, [~bibinchundatt].
{quote}
IIUC the zookeer StateStore is not an active service and zookeeper connection 
is common for leader election too.
Do we really need to close the connection ??
{quote}
RMStateStore is an active service which will be created for every 
RMActiveServices instance. As for zkManager in ZKStateStore, it will reuse 
zkManager for HA when RM uses the Curator-based elector for leader election, 
otherwise it will be created for ZKRMStateStore, so that we should only close 
it when it's not for HA in ZKRMStateStore#serviceStop. Make sense?

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896704#comment-16896704
 ] 

Tao Yang edited comment on YARN-9714 at 7/31/19 2:21 AM:
-

Hi, [~bibinchundatt].
{quote}
IIUC the zookeer StateStore is not an active service and zookeeper connection 
is common for leader election too.
Do we really need to close the connection ??
{quote}
RMStateStore is an active service which will be created for every 
RMActiveServices instance. As for zkManager in ZKStateStore, it will reuse 
zkManager for HA when RM uses the Curator-based elector for leader election, 
otherwise it will be created for ZKRMStateStore, so that we should only close 
it when it's not for HA in ZKRMStateStore#serviceStop. Make sense?


was (Author: tao yang):
Hi, [~bibinchundatt].
{quote}
IIUC the zookeer StateStore is not an active service and zookeeper connection 
is common for leader election too.
Do we really need to close the connection ??
{qoute}
RMStateStore is an active service which will be created for every 
RMActiveServices instance. As for zkManager in ZKStateStore, it will reuse 
zkManager for HA when RM uses the Curator-based elector for leader election, 
otherwise it will be created for ZKRMStateStore, so that we should only close 
it when it's not for HA in ZKRMStateStore#serviceStop. Make sense?

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896704#comment-16896704
 ] 

Tao Yang commented on YARN-9714:


Hi, [~bibinchundatt].
{quote}
IIUC the zookeer StateStore is not an active service and zookeeper connection 
is common for leader election too.
Do we really need to close the connection ??
{qoute}
RMStateStore is an active service which will be created for every 
RMActiveServices instance. As for zkManager in ZKStateStore, it will reuse 
zkManager for HA when RM uses the Curator-based elector for leader election, 
otherwise it will be created for ZKRMStateStore, so that we should only close 
it when it's not for HA in ZKRMStateStore#serviceStop. Make sense?

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-07-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896074#comment-16896074
 ] 

Tao Yang commented on YARN-9642:


Thanks [~bibinchundatt] for this fix.
+1 for the patch.

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9714:
---
Summary: ZooKeeper connection in ZKRMStateStore leaks after RM transitioned 
to standby  (was: Memory leaks after RM transitioned to standby)

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9714:
---
Attachment: YARN-9714.002.patch

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896034#comment-16896034
 ] 

Tao Yang commented on YARN-9714:


Thanks [~bibinchundatt] for the remind. Yes, it should be moved to serviceStart.
I will attach v2 patch to handle the remaining issue. Thanks.

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896007#comment-16896007
 ] 

Tao Yang commented on YARN-9714:


Attached v1 patch for review. cc: [~cheersyang], [~sunil.g], [~leftnoteasy]

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9714:
---
Attachment: YARN-9714.001.patch

> Memory leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Blocker
> Attachments: YARN-9714.001.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9714) Memory leaks after RM transitioned to standby

2019-07-30 Thread Tao Yang (JIRA)
Tao Yang created YARN-9714:
--

 Summary: Memory leaks after RM transitioned to standby
 Key: YARN-9714
 URL: https://issues.apache.org/jira/browse/YARN-9714
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Tao Yang
Assignee: Tao Yang


Recently RM full GC happened in one of our clusters, after investigating the 
dump memory and jstack, I found two places in RM may cause memory leaks after 
RM transitioned to standby:
 # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
 # ZooKeeper connection in ZKRMStateStore never be closed.

To solve those leaks, we should close the connection or cancel the timer when 
services are stopping.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-23 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890952#comment-16890952
 ] 

Tao Yang commented on YARN-9687:


Thanks [~cheersyang], [~sunilg] for the review and comments.
UT failure in rm module seems unrelated to this patch.

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-19 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1631#comment-1631
 ] 

Tao Yang edited comment on YARN-9687 at 7/19/19 12:48 PM:
--

Attached v1 patch with UT for review. 
[~cheersyang], could you please help to review these issues: YARN-9685, 
YARN-9686 and YARN-9687? Thanks.


was (Author: tao yang):
Attached v1 patch with UT for review. 
[~cheersyang], could you please help to review these issue: YARN-9685, 
YARN-9686 and YARN-9687? Thanks.

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-19 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1631#comment-1631
 ] 

Tao Yang commented on YARN-9687:


Attached v1 patch with UT for review. 
[~cheersyang], could you please help to review these issue: YARN-9685, 
YARN-9686 and YARN-9687? Thanks.

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9687:
---
Attachment: YARN-9687.001.patch

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-19 Thread Tao Yang (JIRA)
Tao Yang created YARN-9687:
--

 Summary: Queue headroom check may let unacceptable allocation off 
when using DominantResourceCalculator
 Key: YARN-9687
 URL: https://issues.apache.org/jira/browse/YARN-9687
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Tao Yang
Assignee: Tao Yang


Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
is using {{Resources#greaterThanOrEqual}} which internally compare resources by 
ratio, when using DominantResourceCalculator, it may let unacceptable 
allocations off in some scenarios.
For example:
cluster-resource=<10GB, 10 vcores>
queue-headroom=<2GB, 4 vcores>
required-resource=<3GB, 1 vcores>
In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
that allocations will be let off in scheduling process but will always be 
rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9686) Reduce visibility of blacklisted nodes information (only for current app attempt) to avoid the abuse of memory

2019-07-19 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888758#comment-16888758
 ] 

Tao Yang commented on YARN-9686:


Attached v1 patch for review.

> Reduce visibility of blacklisted nodes information (only for current app 
> attempt) to avoid the abuse of memory
> --
>
> Key: YARN-9686
> URL: https://issues.apache.org/jira/browse/YARN-9686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9686.001.patch
>
>
> Recently we found an issue that RM did a long GC and found many WARN 
> logs(Ignoring Blacklists, blacklist size 1775 is more than failure threshold 
> ratio 0.2000298023224 out of total usable nodes 1778) in RM log with a 
> super high frequency about 3w+/s.
> The direct cause is that a few apps with a large attempts and many 
> blacklisted nodes were requested frequently via REST API or WEB UI. For every 
> single request, RM should allocate new memory for blacklisted nodes for many 
> times(N * NUM_ATTETMPTS).
> Currently both AM(system) blacklisted nodes and app blacklisted nodes are 
> transferred among app attempts and there are only one instance for each 
> other, it's redundant and costly to travel all blacklisted nodes for every 
> app attempt, so that I propose to get and show blacklisted nodes only for 
> current app attempt to enhance performance and avoid the abuse of memory in 
> some similar scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9686) Reduce visibility of blacklisted nodes information (only for current app attempt) to avoid the abuse of memory

2019-07-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9686:
---
Attachment: YARN-9686.001.patch

> Reduce visibility of blacklisted nodes information (only for current app 
> attempt) to avoid the abuse of memory
> --
>
> Key: YARN-9686
> URL: https://issues.apache.org/jira/browse/YARN-9686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9686.001.patch
>
>
> Recently we found an issue that RM did a long GC and found many WARN 
> logs(Ignoring Blacklists, blacklist size 1775 is more than failure threshold 
> ratio 0.2000298023224 out of total usable nodes 1778) in RM log with a 
> super high frequency about 3w+/s.
> The direct cause is that a few apps with a large attempts and many 
> blacklisted nodes were requested frequently via REST API or WEB UI. For every 
> single request, RM should allocate new memory for blacklisted nodes for many 
> times(N * NUM_ATTETMPTS).
> Currently both AM(system) blacklisted nodes and app blacklisted nodes are 
> transferred among app attempts and there are only one instance for each 
> other, it's redundant and costly to travel all blacklisted nodes for every 
> app attempt, so that I propose to get and show blacklisted nodes only for 
> current app attempt to enhance performance and avoid the abuse of memory in 
> some similar scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9686) Reduce visibility of blacklisted nodes information (only for current app attempt) to avoid the abuse of memory

2019-07-19 Thread Tao Yang (JIRA)
Tao Yang created YARN-9686:
--

 Summary: Reduce visibility of blacklisted nodes information (only 
for current app attempt) to avoid the abuse of memory
 Key: YARN-9686
 URL: https://issues.apache.org/jira/browse/YARN-9686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Tao Yang
Assignee: Tao Yang


Recently we found an issue that RM did a long GC and found many WARN 
logs(Ignoring Blacklists, blacklist size 1775 is more than failure threshold 
ratio 0.2000298023224 out of total usable nodes 1778) in RM log with a 
super high frequency about 3w+/s.
The direct cause is that a few apps with a large attempts and many blacklisted 
nodes were requested frequently via REST API or WEB UI. For every single 
request, RM should allocate new memory for blacklisted nodes for many times(N * 
NUM_ATTETMPTS).

Currently both AM(system) blacklisted nodes and app blacklisted nodes are 
transferred among app attempts and there are only one instance for each other, 
it's redundant and costly to travel all blacklisted nodes for every app 
attempt, so that I propose to get and show blacklisted nodes only for current 
app attempt to enhance performance and avoid the abuse of memory in some 
similar scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions

2019-07-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9685:
---
Affects Version/s: 3.3.0

> NPE when rendering the info table of leaf queue in non-accessible partitions
> 
>
> Key: YARN-9685
> URL: https://issues.apache.org/jira/browse/YARN-9685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9685.001.patch
>
>
> I found incomplete queue info shown on scheduler page and NPE in RM log when 
> rendering the info table of leaf queue in non-accessible partitions.
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> {noformat}
> The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in 
> non-accessible partitions are incomplete(part of fields are null such as 
> configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource)
>  but some places in CapacitySchedulerPage don't consider that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions

2019-07-19 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888629#comment-16888629
 ] 

Tao Yang commented on YARN-9685:


Attached v1 patch to fix related places to take care of that scenarios.
Moreover, the width of percentage bars of capacity and used-capacity are 
corrected when absolute max-capacity are zero, avoid dividing by zero and set 
the width not 0 but POSITIVE_INFINITY.

> NPE when rendering the info table of leaf queue in non-accessible partitions
> 
>
> Key: YARN-9685
> URL: https://issues.apache.org/jira/browse/YARN-9685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9685.001.patch
>
>
> I found incomplete queue info shown on scheduler page and NPE in RM log when 
> rendering the info table of leaf queue in non-accessible partitions.
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> {noformat}
> The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in 
> non-accessible partitions are incomplete(part of fields are null such as 
> configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource)
>  but some places in CapacitySchedulerPage don't consider that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions

2019-07-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9685:
---
Attachment: YARN-9685.001.patch

> NPE when rendering the info table of leaf queue in non-accessible partitions
> 
>
> Key: YARN-9685
> URL: https://issues.apache.org/jira/browse/YARN-9685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9685.001.patch
>
>
> I found incomplete queue info shown on scheduler page and NPE in RM log when 
> rendering the info table of leaf queue in non-accessible partitions.
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> {noformat}
> The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in 
> non-accessible partitions are incomplete(part of fields are null such as 
> configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource)
>  but some places in CapacitySchedulerPage don't consider that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions

2019-07-19 Thread Tao Yang (JIRA)
Tao Yang created YARN-9685:
--

 Summary: NPE when rendering the info table of leaf queue in 
non-accessible partitions
 Key: YARN-9685
 URL: https://issues.apache.org/jira/browse/YARN-9685
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Tao Yang
Assignee: Tao Yang


I found incomplete queue info shown on scheduler page and NPE in RM log when 
rendering the info table of leaf queue in non-accessible partitions.
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
{noformat}

The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in 
non-accessible partitions are incomplete(part of fields are null such as 
configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource)
 but some places in CapacitySchedulerPage don't consider that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-07-18 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888451#comment-16888451
 ] 

Tao Yang commented on YARN-8995:


Hi, [~zhuqi], sorry for my late reply. 

LGTM. +1 for the patch. I will commit this if no further comments from others.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9490) applicationresourceusagereport return wrong number of reserved containers

2019-07-18 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887683#comment-16887683
 ] 

Tao Yang commented on YARN-9490:


Hi, [~zyb]. Any updates about this patch? 

> applicationresourceusagereport return wrong number of reserved containers
> -
>
> Key: YARN-9490
> URL: https://issues.apache.org/jira/browse/YARN-9490
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: yanbing zhang
>Assignee: yanbing zhang
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: YARN-9490.002.patch, YARN-9490.patch, 
> YARN-9490.patch1.patch
>
>
> when getting an ApplicationResourceUsageReport instance from the class of 
> SchedulerApplicationAttempt, I found the input constructor 
> parameter(reservedContainers.size()) is wrong.  because the type of this 
> variable is Map>, so 
> "reservedContainer.size()" is not the number of containers, but the number of 
> SchedulerRequestKey.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9635) Nodes page displayed duplicate nodes

2019-07-16 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886643#comment-16886643
 ] 

Tao Yang commented on YARN-9635:


Thanks [~jiwq] for the patch and sorry for my late reply. 

Another related place is the description of that conf in NodeManager.md, it 
should be updated as well.

> Nodes page displayed duplicate nodes
> 
>
> Key: YARN-9635
> URL: https://issues.apache.org/jira/browse/YARN-9635
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Attachments: UI2-nodes.jpg, YARN-9635.001.patch
>
>
> Steps:
>  * shutdown nodes
>  * start nodes
> Nodes Page:
> !UI2-nodes.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-07-05 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: YARN-9567.002.patch

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-05 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879157#comment-16879157
 ] 

Tao Yang commented on YARN-9664:


Hi, [~cheersyang], I have summarized all diagnostics in the activities 
document: 
[https://docs.google.com/document/d/1NIIDCWOLUqlhrclzr91YPYOrvmLfjlKbyCWrnCMci0g/edit#heading=h.8dw59zh7e137].
 It may be helpful for the review.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2019-07-05 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879155#comment-16879155
 ] 

Tao Yang commented on YARN-9538:


Hi, [~cheersyang]. I have update the document under V3 separate line based on 
latest updates in YARN-9664. Please help to review in your free time, thanks.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-05 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879130#comment-16879130
 ] 

Tao Yang commented on YARN-9664:


Attached v2 patch with additional updates:
 * Add activities DAO classes into JAXBContextResolver to correct the output 
format.
 * Refactor field names: activityState -> allocationState(Revert modification 
in V1 patch),  timeStamp -> timestamp
 * Update types of priority/allocationRequestId fields to numeric(Integer/Long)
 * Support comma-separated format for parameters (type is Set) of app 
activities REST API.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-05 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9664:
---
Attachment: YARN-9664.002.patch

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-03 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9664:
---
Attachment: (was: YARN-9664.001.patch)

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-03 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9664:
---
Attachment: YARN-9664.001.patch

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.001.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-03 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9664:
---
Attachment: YARN-9664.001.patch

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-03 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9664:
---
Attachment: (was: YARN-9664.001.patch)

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9658) Fix UT failures in TestLeafQueue

2019-07-02 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877495#comment-16877495
 ] 

Tao Yang commented on YARN-9658:


Thanks [~cheersyang] for the review and commit.

> Fix UT failures in TestLeafQueue
> 
>
> Key: YARN-9658
> URL: https://issues.apache.org/jira/browse/YARN-9658
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: YARN-9658.001.patch
>
>
> In ActivitiesManager, if there's no yarn configuration in mock RMContext, 
> cleanup interval can't be initialized to 5 seconds by default, causing the 
> cleanup thread keeps running repeatedly without interval which may bring 
> problems to mockito framework, it caused OOM in this case, internally many 
> throwable objects were generated by incomplete mock.
> Add configuration for mock RMContext to fix failures in TestLeafQueue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-07-02 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877466#comment-16877466
 ] 

Tao Yang commented on YARN-9664:


Attached v1 patch for review.
Updates:
* Adjust structure and diagnostics for the response of activities
** Refactor activity diagnostics to make them more sensible: begin with 
activity level and adjust some contents
** Refactor field names: allocationState -> activityState, requestAllocation -> 
requestAllocations, allocationAttempt -> allocationAttempts
** Adjust the sequence of some fields
** Correct unreasonable activity states
** Add activity diagnostics for initial check at the beginning of scheduling 
process
* Support recording and showing partition name in scheduler activities
* Add activity level including QUEUE/APP/REQUEST/NODE and improve the recording 
process to get better classifications
* UT
** Add new test cases (testQueueSkippedBecauseOfHeadroom and 
testNodeSkippedBecauseOfRelaxLocality) in TestRMWebServicesSchedulerActivities 
to test diagnostics at request/node level
** Add testPartitionInSchedulerActivities in 
TestRMWebServicesForCSWithPartitions to test partition information
** Update frequently-used strings to be constants

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   >