[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-16 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10955:

Description: 
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
> keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add 

[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-16 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10955:

Description: 
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
updateTime(long), diagnostics(string) and keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) 
> and keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add HealthCheckerService which can manage 

[jira] [Commented] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-15 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415388#comment-17415388
 ] 

Tao Yang commented on YARN-10955:
-

Any suggestions and comments are welcome!

cc [~cheersyang], [~leftnoteasy], [~sunil.g], hope to hear your thoughts about 
this.

Thanks!

> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> updateTime(long), diagnostics(string) and keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add HealthCheckerService which can manage and monitor all reportable 
> services, support checking and fetching health reports periodically and 
> manually (can be triggered by REST API), publishing metrics and logs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-15 Thread Tao Yang (Jira)
Tao Yang created YARN-10955:
---

 Summary: Add health check mechanism to improve troubleshooting 
skills for RM
 Key: YARN-10955
 URL: https://issues.apache.org/jira/browse/YARN-10955
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Tao Yang
Assignee: Tao Yang


RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
updateTime(long), diagnostics(string) and keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10909) AbstractCSQueue: Check for methods added for test code but not annotated with VisibleForTesting

2021-09-12 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413883#comment-17413883
 ] 

Tao Yang commented on YARN-10909:
-

Thanks [~snemeth] for the reminder and comments in the PR!  I will pay 
attention to that next time. :) 

> AbstractCSQueue: Check for methods added for test code but not annotated with 
> VisibleForTesting
> ---
>
> Key: YARN-10909
> URL: https://issues.apache.org/jira/browse/YARN-10909
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: jackwangcs
>Priority: Minor
>  Labels: newbie, pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> For example, AbstractCSQueue#setMaxCapacity(float) is only used for testing, 
> but not annotated. There can be other methods in this class like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management

2021-09-12 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang resolved YARN-10928.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Committed to trunk. Thanks [~Weihao Zheng] for the contribution! 

> Support default queue properties of capacity scheduler to simplify 
> configuration management
> ---
>
> Key: YARN-10928
> URL: https://issues.apache.org/jira/browse/YARN-10928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Assignee: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> There are many user cases that one user owns many queues in his 
> organization's cluster for different business usages in practice. These 
> queues often share the same properties, such as minimum-user-limit-percent 
> and user-limit-factor. Users have to write one property for every queue they 
> use if they want to use customized these shared properties. Adding default 
> queue properties for these cases will simplify capacity scheduler's 
> configuration file and make it easy to adjust queue's common properties. 
>   
>   CHANGES:
> Add two properties as queue's default value in capacity scheduler's 
> configuration:
>  * {{yarn.scheduler.capacity.minimum-user-limit-percent}}
>  * {{yarn.scheduler.capacity.user-limit-factor}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-12 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang resolved YARN-10903.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Committed to trunk already. Thanks [~jackwangcs] for the contribution and 
[~epayne] for the review.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-09 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412922#comment-17412922
 ] 

Tao Yang commented on YARN-10903:
-

+1 for the PR, will merge it after a few days if there are no objections.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management

2021-09-09 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412918#comment-17412918
 ] 

Tao Yang commented on YARN-10928:
-

The PR LGTM now, +1 from my side. I will merge this PR after a few days if 
there are no objections.

> Support default queue properties of capacity scheduler to simplify 
> configuration management
> ---
>
> Key: YARN-10928
> URL: https://issues.apache.org/jira/browse/YARN-10928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Assignee: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> There are many user cases that one user owns many queues in his 
> organization's cluster for different business usages in practice. These 
> queues often share the same properties, such as minimum-user-limit-percent 
> and user-limit-factor. Users have to write one property for every queue they 
> use if they want to use customized these shared properties. Adding default 
> queue properties for these cases will simplify capacity scheduler's 
> configuration file and make it easy to adjust queue's common properties. 
>   
>   CHANGES:
> Add two properties as queue's default value in capacity scheduler's 
> configuration:
>  * {{yarn.scheduler.capacity.minimum-user-limit-percent}}
>  * {{yarn.scheduler.capacity.user-limit-factor}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10909) AbstractCSQueue: Check for methods added for test code but not annotated with VisibleForTesting

2021-09-08 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412303#comment-17412303
 ] 

Tao Yang commented on YARN-10909:
-

Hi, [~jackwangcs]. VisibleForTesting annotation can be used for the methods be 
called only in Test scope.

For the PR, I can see AbstractCSQueue#hasChildQueues and 
AbstractCSQueue#getLastSubmittedTimestamp are called in both Product and Test 
scopes, VisibleForTesting annotation is not fit for them, please take a look.

> AbstractCSQueue: Check for methods added for test code but not annotated with 
> VisibleForTesting
> ---
>
> Key: YARN-10909
> URL: https://issues.apache.org/jira/browse/YARN-10909
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: jackwangcs
>Priority: Minor
>  Labels: newbie, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> For example, AbstractCSQueue#setMaxCapacity(float) is only used for testing, 
> but not annotated. There can be other methods in this class like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-08 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412281#comment-17412281
 ] 

Tao Yang commented on YARN-10903:
-

Thanks [~jackwangcs] for raising this issue, which may generate invalid 
proposals to slow down the normal scheduling process. Good catch! 

The PR generally LGTM, just some minor check-style warnings need to be fixed, 
please take a look.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management

2021-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410898#comment-17410898
 ] 

Tao Yang commented on YARN-10928:
-

Hi, [~wwei]. Could you please help to authorize [~Weihao Zheng] as a 
contributor so that he can assign this issue to himself. Thanks!

> Support default queue properties of capacity scheduler to simplify 
> configuration management
> ---
>
> Key: YARN-10928
> URL: https://issues.apache.org/jira/browse/YARN-10928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> There are many user cases that one user owns many queues in his 
> organization's cluster for different business usages in practice. These 
> queues often share the same properties, such as minimum-user-limit-percent 
> and user-limit-factor. Users have to write one property for every queue they 
> use if they want to use customized these shared properties. Adding default 
> queue properties for these cases will simplify capacity scheduler's 
> configuration file and make it easy to adjust queue's common properties. 
>   
>   CHANGES:
> Add two properties as queue's default value in capacity scheduler's 
> configuration:
>  * {{yarn.scheduler.capacity.minimum-user-limit-percent}}
>  * {{yarn.scheduler.capacity.user-limit-factor}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management

2021-08-31 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407775#comment-17407775
 ] 

Tao Yang commented on YARN-10928:
-

Thanks [~Weihao Zheng] for filling this ticket!

I think it's very useful for cluster management, and it's reasonable to use the 
non-specified-queue configuration as some cluster-level configurations already 
did (e.g. yarn.scheduler.capacity.maximum-applications / 
yarn.scheduler.capacity.maximum-am-resource-percent / 
yarn.scheduler.capacity.max-parallel-apps).

For the PR, I think it's better to add a simple UT in TestApplicationLimit and 
make sure the related document is updated.

> Support default queue properties of capacity scheduler to simplify 
> configuration management
> ---
>
> Key: YARN-10928
> URL: https://issues.apache.org/jira/browse/YARN-10928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are many user cases that one user owns many queues in his 
> organization's cluster for different business usages in practice. These 
> queues often share the same properties, such as minimum-user-limit-percent 
> and user-limit-factor. Users have to write one property for every queue they 
> use if they want to use customized these shared properties. Adding default 
> queue properties for these cases will simplify capacity scheduler's 
> configuration file and make it easy to adjust queue's common properties. 
>   
>   CHANGES:
> Add two properties as queue's default value in capacity scheduler's 
> configuration:
>  * {{yarn.scheduler.capacity.minimum-user-limit-percent}}
>  * {{yarn.scheduler.capacity.user-limit-factor}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-08-02 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391851#comment-17391851
 ] 

Tao Yang commented on YARN-10854:
-

Thanks [~zhuqi], [~templedf], [~prabhujoseph] and [~kshukla] !

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch, YARN-10854.004.patch, YARN-10854.005.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-08-01 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391256#comment-17391256
 ] 

Tao Yang commented on YARN-10854:
-

Thanks [~zhuqi] for the review. 
Attached v5 patch to replace illegal import class 
'com.google.common.collect.Sets' with 'org.apache.hadoop.util.Sets'.

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch, YARN-10854.004.patch, YARN-10854.005.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-08-01 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Attachment: YARN-10854.005.patch

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch, YARN-10854.004.patch, YARN-10854.005.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-30 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390425#comment-17390425
 ] 

Tao Yang commented on YARN-10854:
-

Thanks [~zhuqi] and [~prabhujoseph] for the review and feedback.
Attached v4 patch with improvement on UT as Qi suggested. Please help to review 
in your free time, Thanks.

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch, YARN-10854.004.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-30 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Attachment: YARN-10854.004.patch

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch, YARN-10854.004.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-28 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Target Version/s: 3.4.0  (was: 3.3.2)

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-28 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389166#comment-17389166
 ] 

Tao Yang commented on YARN-10854:
-

Thanks [~templedf] for the review and feedback.
I would like to clarify that this improvement has nothing to do with the 
scheduling process, it just want to mark inactive nodes as untracked after the 
timeout specified by yarn.resourcemanager.node-removal-untracked.timeout-ms and 
then remove them from nodes list for the YARN cluster without configured 
include path, which means RM can periodically clear inactive nodes to avoid 
increasing memory to store these data, most desired by elastic cloud 
environment with frequent auto-scaling operations. 
Attached v3 patch with some background information. [~zhuqi], [~snemeth], could 
you please help to take a look at this issue? Thanks.

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-28 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Attachment: YARN-10854.003.patch

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch, 
> YARN-10854.003.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-26 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387758#comment-17387758
 ] 

Tao Yang commented on YARN-10854:
-

Thanks [~kshukla] and [~templedf] for the feedbacks and review.

Attached v2 patch with further explanation of the new configuration in 
yarn-default.xml, [~templedf] could you please help to review that again? 
Thanks.

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-26 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Attachment: YARN-10854.002.patch

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch, YARN-10854.002.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-15 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Description: 
Currently inactive nodes which have been decommissioned/shutdown/lost for a 
while(specified expiration time defined via 
{{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
default) and not exist in both include and exclude files can be marked as 
untracked nodes and can be removed from RM state (YARN-4311). It's very useful 
when auto-scaling is enabled in elastic cloud environment, which can avoid 
unlimited increase of inactive nodes (mostly are decommissioned nodes).

But this only works when the include path is configured, mismatched for most of 
our cloud environments without configured white list of nodes, which can lead 
to easily control for the auto-scaling of nodes without further security 
requirements.

So I propose to support marking inactive node as untracked without configured 
include path, to be compatible with the former versions, we can add a switch 
config for this.

Any thoughts/suggestions/feedbacks are welcome!

  was:
Currently inactive nodes which have been decommissioned/shutdown/lost for a 
while(specified expiration time defined via 
{{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
default) and not exist in both include and exclude files can be marked as 
untracked nodes and can be removed from RM state. It's very useful when 
auto-scaling is enabled in elastic cloud environment, which can avoid unlimited 
increase of inactive nodes (mostly are decommissioned nodes).

But this only works when the include path is configured, mismatched for most of 
our cloud environments without configured white list of nodes, which can lead 
to easily control for the auto-scaling of nodes without further security 
requirements.

So I propose to support marking inactive node as untracked without configured 
include path, to be compatible with the former versions, we can add a switch 
config for this.

Any thoughts/suggestions/feedbacks are welcome!


> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state (YARN-4311). It's very 
> useful when auto-scaling is enabled in elastic cloud environment, which can 
> avoid unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-15 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10854:

Attachment: YARN-10854.001.patch

> Support marking inactive node as untracked without configured include path
> --
>
> Key: YARN-10854
> URL: https://issues.apache.org/jira/browse/YARN-10854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10854.001.patch
>
>
> Currently inactive nodes which have been decommissioned/shutdown/lost for a 
> while(specified expiration time defined via 
> {{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
> default) and not exist in both include and exclude files can be marked as 
> untracked nodes and can be removed from RM state. It's very useful when 
> auto-scaling is enabled in elastic cloud environment, which can avoid 
> unlimited increase of inactive nodes (mostly are decommissioned nodes).
> But this only works when the include path is configured, mismatched for most 
> of our cloud environments without configured white list of nodes, which can 
> lead to easily control for the auto-scaling of nodes without further security 
> requirements.
> So I propose to support marking inactive node as untracked without configured 
> include path, to be compatible with the former versions, we can add a switch 
> config for this.
> Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10854) Support marking inactive node as untracked without configured include path

2021-07-15 Thread Tao Yang (Jira)
Tao Yang created YARN-10854:
---

 Summary: Support marking inactive node as untracked without 
configured include path
 Key: YARN-10854
 URL: https://issues.apache.org/jira/browse/YARN-10854
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Tao Yang
Assignee: Tao Yang


Currently inactive nodes which have been decommissioned/shutdown/lost for a 
while(specified expiration time defined via 
{{yarn.resourcemanager.node-removal-untracked.timeout-ms}}, 60 seconds by 
default) and not exist in both include and exclude files can be marked as 
untracked nodes and can be removed from RM state. It's very useful when 
auto-scaling is enabled in elastic cloud environment, which can avoid unlimited 
increase of inactive nodes (mostly are decommissioned nodes).

But this only works when the include path is configured, mismatched for most of 
our cloud environments without configured white list of nodes, which can lead 
to easily control for the auto-scaling of nodes without further security 
requirements.

So I propose to support marking inactive node as untracked without configured 
include path, to be compatible with the former versions, we can add a switch 
config for this.

Any thoughts/suggestions/feedbacks are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-09-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204467#comment-17204467
 ] 

Tao Yang commented on YARN-8737:


Hi, [~Amithsha], [~wangda], [~bteke]. Sorry for missing this issue so long.
I haven't dug into this issue or checked if the exception never happen again (I 
have just search the key words "Comparison method violates its general 
contract" from RM logs of our YARN clusters which can only be stored for 7 
days, nothing returned) since this exception can't crash or affect the 
scheduling process in our internal versions. 
After looking into YARN-10178, I think this problem may be raised by multiple 
causes, the same point is that some resources like capacity-resource or 
used-resource in child queues(leaf or parent queue) changed while parent queue 
is sorting them. 
I think this patch can solve the problem for the configurations-updating 
scenario, adding read lock in ParentQueue#sortAndGetChildrenAllocationIterator 
can avoid the child queues' configured capacity be updated while being sorted. 
[~wangda], [~bteke] very appreciate if you can help to review and commit this 
patch.
And we should also fix the problem for the scheduling scenario in YARN-10178.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151440#comment-17151440
 ] 

Tao Yang commented on YARN-10319:
-

Thanks [~prabhujoseph] for updating the patch. The latest patch LGTM.
[~adam.antal], could you please help to review again? Thanks.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-01 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149285#comment-17149285
 ] 

Tao Yang commented on YARN-10319:
-

Thanks [~adam.antal] for the review and comments, [~prabhujoseph], could you 
please consider these suggestions as well?
Most changes in the latest patch LGTM, a minor suggestion is to change root 
element name of BulkActivitiesInfo from "schedulerActivities" to 
"bulkActivities", some related places like 
ActivitiesTestUtils#FN_SCHEDULER_BULK_ACT_ROOT should be changed as well.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-06-30 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149030#comment-17149030
 ] 

Tao Yang commented on YARN-10319:
-

Thanks for updating the patch and sorry for missing the last comment, I will 
take a look at the latest patch later today. 

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-06-23 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143416#comment-17143416
 ] 

Tao Yang commented on YARN-10319:
-

Thanks [~prabhujoseph] for this improvement.
I agree that it may be helpful for single-node lookup mechanism, with which 
users can get all-nodes activities in multiple scheduling cycles at once for 
better debugging. 
Some comments about the patch:
* Is it better to rename "bulkactivities" (REST API name) to "bulk-activities"? 
* SchedulerActivitiesInfo is similar to ActivitiesInfo which also means 
scheduler activities info, can we rename it to BulkActivitiesInfo?
* To keep consistence, we can also rename RMWebServices#getLastNActivities to 
RMWebServices#getBulkActivities.
* ActivitiesManager#recordCount can be affected by both activities and 
bulk-activities REST APIs, we can use `recordCount.compareAndSet(0, 1)` instead 
of `recordCount.set(1)` to avoid getting unexpected number of bulk-activities, 
right?
* The fetching approaches of activities and bulk-activities REST API are 
different (asynchronous or synchronous), I think we should elaborate this in 
the document.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2020-06-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133859#comment-17133859
 ] 

Tao Yang commented on YARN-8011:


Thanks [~Jim_Brennan] for the feedback and contribution.
The patch for branch-2.10 LGTM, already committed to branch-2.10. Thanks.

> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: YARN-8011-branch-2.10.001.patch, YARN-8011.001.patch, 
> YARN-8011.002.patch
>
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
>  
> This problem is caused by that deducting resource is a little behind the 
> assertion. To solve this problem, It can sleep a while before this assertion 
> as below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk

2020-06-11 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8011:
---
Fix Version/s: 2.10.1

> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  fails sometimes in trunk
> ---
>
> Key: YARN-8011
> URL: https://issues.apache.org/jira/browse/YARN-8011
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.1.0, 2.10.1
>
> Attachments: YARN-8011-branch-2.10.001.patch, YARN-8011.001.patch, 
> YARN-8011.002.patch
>
>
> TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart
>  often pass, but the following errors sometimes occur:
> {noformat}
> java.lang.AssertionError: 
> Expected :15360
> Actual :14336
> 
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}
>  
> This problem is caused by that deducting resource is a little behind the 
> assertion. To solve this problem, It can sleep a while before this assertion 
> as below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133848#comment-17133848
 ] 

Tao Yang commented on YARN-10293:
-

I think this patch is fine enough, and would like to commit the latest patch if 
there is no objection in a few hours. Thanks [~prabhujoseph] for this 
contribution.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> 2020-05-21 12:13:33,243 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-09 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129091#comment-17129091
 ] 

Tao Yang commented on YARN-10293:
-

Thanks [~prabhujoseph] for updating the patch.
LGTM now, [~wangda], do you have some comments or suggestions about the patch? 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> 2020-05-21 12:13:33,243 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> {code}

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-07 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127867#comment-17127867
 ] 

Tao Yang commented on YARN-10293:
-

Thanks [~prabhujoseph] for updating the patch.
Another concern in UT is that could you finish the UT without updating the 
controlling access for SchedulerNode#addUnallocatedResource?  I think directly 
calling SchedulerNode#addUnallocatedResource in UT is hard to understand.
BTW, please fix the remaining check-style warning, UT failures seem unrelated 
to this patch.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch, YARN-10293-004.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126407#comment-17126407
 ] 

Tao Yang commented on YARN-10293:
-

Thanks [~prabhujoseph] for this effort. I'm fine, please go ahead.
{quote}
Yes sure, YARN-9598 addresses many other issues. Will check how to contribute 
to the same and address any other optimization required.
{quote}
Good to hear that, Thanks.
For the patch, overall it looks good, some suggestions about the UT:
* In TestCapacitySchedulerMultiNodes#testExcessReservationWillBeUnreserved, 
this patch changes the behavior of second-to-last allocation and make last 
allocation unnecessary, can you remove line 261 to line 267 to make it more 
clear? 
{code}
Assert.assertEquals(1, schedulerApp1.getLiveContainers().size());
Assert.assertEquals(0, schedulerApp1.getReservedContainers().size());
-Assert.assertEquals(1, schedulerApp2.getLiveContainers().size());
-
-// Trigger scheduling to allocate a container on nm1 for app2.
-cs.handle(new NodeUpdateSchedulerEvent(rmNode1));
-Assert.assertNull(cs.getNode(nm1.getNodeId()).getReservedContainer());
-Assert.assertEquals(1, schedulerApp1.getLiveContainers().size());
-Assert.assertEquals(0, schedulerApp1.getReservedContainers().size());
Assert.assertEquals(2, schedulerApp2.getLiveContainers().size());
Assert.assertEquals(7 * GB,
cs.getNode(nm1.getNodeId()).getAllocatedResource().getMemorySize());
Assert.assertEquals(12 * GB,
cs.getRootQueue().getQueueResourceUsage().getUsed().getMemorySize());
{code}

* Can we remove the 
TestCapacitySchedulerMultiNodesWithPreemption#getFiCaSchedulerApp method and 
get the scheduler app via calling CapacityScheduler#getApplicationAttempt ? 
* There are lots of while clauses, Thread#sleep callings and async-thread 
creation for checking states in 
TestCapacitySchedulerMultiNodesWithPreemption#testAllocationOfReservationFromOtherNode,
 could you please calling GenericTestUtils#waitFor, MockRM#waitForState etc. to 
simplify it?

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124527#comment-17124527
 ] 

Tao Yang commented on YARN-10293:
-

Thanks [~wangda] for your confirmation.
I think the proposed change can solve the problem for heartbeat-driven 
scheduling but not async scheduling, since it may still keep in a loop that 
chooses the first one of candidate nodes then do re-reservation as mentioned in 
YARN-9598.
However, if what we want for this issue is just to fix this problem for 
heartbeat-driven scenarios, and later will have a more complete solution, the 
change is fine to me for now. In our internal version, we already remove this 
check to support allocating OPPORTUNISTIC containers in the main scheduling 
process.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123686#comment-17123686
 ] 

Tao Yang commented on YARN-10293:
-

Hi, [~prabhujoseph], [~wangda]
This problem is similar to YARN-9598, which was in dispute so there's no 
further progress. In my opinion, YARN-9598 and this issue may just parts of 
reservation problems, it's better to refactor the reservation logic again to 
compatible with the scheduling framework which has been updated a lot by global 
scheduler, especially for multi-nodes lookup mechanism. At least we should 
rethink all referenced logic in scheduling cycle to have a more complete 
solution for current reservation. Thoughts?

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> 

[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities

2020-03-13 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059185#comment-17059185
 ] 

Tao Yang commented on YARN-9050:


Thanks [~cheersyang] very much for your help and patience, very appreciate!

> [Umbrella] Usability improvements for scheduler activities
> --
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
>  1. Not available for multi-thread asynchronous scheduling. App and node 
> activities maybe confused when multiple scheduling threads record activities 
> of different allocation processes in the same variables like appsAllocation 
> and recordingNodesAllocation in ActivitiesManager. I think these variables 
> should be thread-local to make activities clear among multiple threads.
>  2. Incomplete activities for multi-node lookup mechanism, since 
> ActivitiesLogger will skip recording through \{{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup mechanism.
>  3. Current app activities can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activities, add diagnosis for placement constraints check, update 
> insufficient resource diagnosis with detailed info (like 'insufficient 
> resource names:[memory-mb]') and so on.
>  4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
>  5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
>  6. Aggregate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggregation for app activities by 
> diagnoses is necessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnostics.
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.
> Running design doc is attached 
> [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10192) CapacityScheduler stuck in loop rejecting allocation proposals

2020-03-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057537#comment-17057537
 ] 

Tao Yang commented on YARN-10192:
-

Hi, [~wangda]. 
I'm not sure about this issue, we have found some issues when async-scheduling 
is enabled, this issue seemsnot in the async-scheduling mode according to 
the logs above and it's hard to found the root cause from these logs, I think 
more logs are needed for further analyzing via dynamically updating log level 
of some important classes (such as 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
 to DEBUG. BTW, scheduler activities is more useful for debugging but only 
applicable after version-3.3.

> CapacityScheduler stuck in loop rejecting allocation proposals
> --
>
> Key: YARN-10192
> URL: https://issues.apache.org/jira/browse/YARN-10192
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Priority: Major
>
> On a 2.10.0 cluster, we observed containers were being scheduled very slowly. 
> Based on logs, it seems to reject a bunch of allocation proposals, then 
> accept a bunch of reserved containers, but very few containers are actually 
> getting allocated:
> {noformat}
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,981 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default 

[jira] [Commented] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality

2020-02-18 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039665#comment-17039665
 ] 

Tao Yang commented on YARN-10151:
-

Hi, [~leftnoteasy]  FYI, a related issue which can make that happen has been 
solved in YARN-9838.

> Disable Capacity Scheduler's move app between queue functionality
> -
>
> Key: YARN-10151
> URL: https://issues.apache.org/jira/browse/YARN-10151
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Saw this happened in many clusters: Capacity Scheduler cannot work correctly 
> with the move app between queue features. It will cause weird JMX issue, 
> resource accounting issue, etc. In a lot of causes it will cause RM 
> completely hung and available resource became negative, nothing can be 
> allocated after that. We should turn off CapacityScheduler's move app between 
> queue feature. (see: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}}
>  )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-02-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029648#comment-17029648
 ] 

Tao Yang commented on YARN-9567:


Thanks [~cheersyang] for the review. 
It seems that wrong file was taken as the new patch, some information in 
console output: YARN-9567 patch is being downloaded at Mon Feb  3 20:38:28 UTC 
2020 from  
https://issues.apache.org/jira/secure/attachment/12991343/scheduler-activities-example.png
 -> Downloaded
Attached v4 patch (same as v3 patch) to re-trigger the jenkins job.

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, YARN-9567.004.patch, app-activities-example.png, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, scheduler-activities-example.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-02-04 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: YARN-9567.004.patch

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, YARN-9567.004.patch, app-activities-example.png, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, scheduler-activities-example.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-19 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019278#comment-17019278
 ] 

Tao Yang commented on YARN-9538:


Thanks [~cheersyang] for the review. 
Attached v4 patch to fix failures in Jenkins.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch, 
> YARN-9538.003.patch, YARN-9538.004.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-19 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9538:
---
Attachment: YARN-9538.004.patch

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch, 
> YARN-9538.003.patch, YARN-9538.004.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-19 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019208#comment-17019208
 ] 

Tao Yang commented on YARN-9567:


Thanks [~cheersyang] for the review. I have attached V3 patch with updates:
 * Enable showing activities info only when CS is enabled.
 * Support pagination for the activities table, examples:
Showing app diagnostics:
!app-activities-example.png! 
Showing scheduler activities (when app diagnostics are not found):
!scheduler-activities-example.png! 

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, app-activities-example.png, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, scheduler-activities-example.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-19 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: scheduler-activities-example.png

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, app-activities-example.png, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, scheduler-activities-example.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-19 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: app-activities-example.png

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, app-activities-example.png, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-19 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: YARN-9567.003.patch

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-19 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: (was: YARN-9567.003.patch)

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-19 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9567:
---
Attachment: YARN-9567.003.patch

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> YARN-9567.003.patch, image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()

2020-01-10 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012615#comment-17012615
 ] 

Tao Yang commented on YARN-7007:


Already cherry-picked this fix to branch-2.8

> NPE in RM while using YarnClient.getApplications()
> --
>
> Key: YARN-7007
> URL: https://issues.apache.org/jira/browse/YARN-7007
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Lingfeng Su
>Assignee: Lingfeng Su
>Priority: Major
>  Labels: patch
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.6
>
> Attachments: YARN-7007.001.patch
>
>
> {code:java}
> java.lang.NullPointerException: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254)
>   at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy161.getApplications(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456)
> {code}
> When I use YarnClient.getApplications() to  get all applications of RM, 
> Occasionally, it throw a  NPE problem.
> {code:java}
> RMAppAttempt currentAttempt = rmContext.getRMApps()
>.get(attemptId.getApplicationId()).getCurrentAppAttempt();
> {code}
> if the application id is not in ConcurrentMap 
> getRMApps(), it may throw NPE problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7007) NPE in RM while using YarnClient.getApplications()

2020-01-10 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7007:
---
Fix Version/s: 2.8.6

> NPE in RM while using YarnClient.getApplications()
> --
>
> Key: YARN-7007
> URL: https://issues.apache.org/jira/browse/YARN-7007
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Lingfeng Su
>Assignee: Lingfeng Su
>Priority: Major
>  Labels: patch
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.6
>
> Attachments: YARN-7007.001.patch
>
>
> {code:java}
> java.lang.NullPointerException: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254)
>   at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy161.getApplications(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456)
> {code}
> When I use YarnClient.getApplications() to  get all applications of RM, 
> Occasionally, it throw a  NPE problem.
> {code:java}
> RMAppAttempt currentAttempt = rmContext.getRMApps()
>.get(attemptId.getApplicationId()).getCurrentAppAttempt();
> {code}
> if the application id is not in ConcurrentMap 
> getRMApps(), it may throw NPE problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()

2020-01-10 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012554#comment-17012554
 ] 

Tao Yang commented on YARN-7007:


[~fly_in_gis], thanks for the feedback, I will cherry-pick this fix to 2.8 
later.

> NPE in RM while using YarnClient.getApplications()
> --
>
> Key: YARN-7007
> URL: https://issues.apache.org/jira/browse/YARN-7007
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Lingfeng Su
>Assignee: Lingfeng Su
>Priority: Major
>  Labels: patch
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-7007.001.patch
>
>
> {code:java}
> java.lang.NullPointerException: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254)
>   at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy161.getApplications(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456)
> {code}
> When I use YarnClient.getApplications() to  get all applications of RM, 
> Occasionally, it throw a  NPE problem.
> {code:java}
> RMAppAttempt currentAttempt = rmContext.getRMApps()
>.get(attemptId.getApplicationId()).getCurrentAppAttempt();
> {code}
> if the application id is not in ConcurrentMap 
> getRMApps(), it may throw NPE problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2020-01-09 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011789#comment-17011789
 ] 

Tao Yang commented on YARN-9567:


Thanks [~cheersyang] for the review.
{quote}
1. since this is a CS only feature, pls make sure nothing breaks when FS is 
enabled
{quote}
Yes, it should show this table only when CS is enabled, will updated in next 
patch.

{quote}
2. does the table support paging?  
{quote}
Not yet, I think it's not a strong requirement which only used for debugging, 
we can rarely got a long table about that, and even if we have, it may have a 
minor impact for the UI, right?

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9567.001.patch, YARN-9567.002.patch, 
> image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, 
> image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, 
> no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-09 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9538:
---
Attachment: YARN-9538.003.patch

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch, 
> YARN-9538.003.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-09 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011781#comment-17011781
 ] 

Tao Yang commented on YARN-9538:


Attached v3 patch in which most comments are addressed, updates need more 
discussion are as follows:

CS
 1. the table of content can be auto-generated by Doxia Macros via defining 
"MACRO\{toc|fromDepth=0|toDepth=3}", so there's nothing we can do for this.

I have updated other modifications,  please help to review them as well, thanks:

//  Activities

Scheduling activities are activity messages used for debugging on some critical 
scheduling path, they can be recorded and exposed via RESTful API with minor 
impact on the scheduler performance.

// Scheduler Activities

Scheduler activities include useful scheduling info in a scheduling cycle, 
which illustrate how the scheduler allocates a container.

Scheduler activities REST API 
(`http://rm-http-address:port/ws/v1/cluster/scheduler/activities`) provides a 
way to enable recording scheduler activities and fetch them from cache.To 
eliminate the performance impact, scheduler automatically disables recording 
activities at the end of a scheduling cycle, you can query the RESTful API 
again to get the latest scheduler activities. 

// Application Activities

Application activities include useful scheduling info for a specified 
application, which illustrate how the requirements are satisfied or just 
skipped. Application activities REST API 
(`http://rm-http-address:port/ws/v1/cluster/scheduler/app-activities/\{appid}`) 
provides a way to enable recording application activities for a specified 
application within a few seconds or fetch historical application activities 
from cache, available actions which include "refresh" and "get" can be 
specified by the "actions" parameter:

 

RM
 1. +The scheduler activities API currently supports Capacity Scheduler and 
provides a way to get scheduler activities in a single scheduling process, it 
will trigger recording scheduler activities in next scheduling process and then 
take last required scheduler activities from cache as the response. The 
response have hierarchical structure with multiple levels and important 
scheduling details which are organized by the sequence of scheduling process:

->

The scheduler activities Restful API {color:#FF}is available if you are 
using capacity scheduler and{color} can fetch scheduler activities info 
recorded in a scheduling cycle. The API returns a message that includes 
important scheduling activities info {color:#FF}which has a hierarchical 
layout with following fields:{color}

 

7. + Application activities include useful scheduling info for a specified 
application, the response have hierarchical structure with multiple levels:

->

Application activities Restful API {color:#FF}is available if you are using 
capacity scheduler and can fetch useful scheduling info for a specified 
application{color}, the response has a hierarchical layout with following 
fields:

 

8. * *AppActivities* - AppActivities are root structure of application 
activities within basic information.

->

is the root element?

Yes, updated: AppActivities are root {color:#FF}element{color} ... 

9. +* *Applications* - Allocations are allocation attempts at app level queried 
from the cache.
 ->

shouldn't here be applications?

Right, updated: +* {color:#FF}*Allocations*{color} - Allocations ...

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch, 
> YARN-9538.003.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-08 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011505#comment-17011505
 ] 

Tao Yang commented on YARN-9538:


Thanks [~cheersyang] for finding out mistakes and providing better 
descriptions, I'll fix them as soon as possible.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-08 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011387#comment-17011387
 ] 

Tao Yang commented on YARN-9538:


Attached v2 patch which have been checked via hugo in my local test environment.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9538) Document scheduler/app activities and REST APIs

2020-01-08 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9538:
---
Attachment: YARN-9538.002.patch

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch, YARN-9538.002.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities

2020-01-07 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010339#comment-17010339
 ] 

Tao Yang commented on YARN-9050:


Glad to hear that 3.3.0 release is on the way and thanks for reminding me.
The remaining issues are almost ready and only need some reviews, they can be 
done before this release, thanks.

> [Umbrella] Usability improvements for scheduler activities
> --
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
>  1. Not available for multi-thread asynchronous scheduling. App and node 
> activities maybe confused when multiple scheduling threads record activities 
> of different allocation processes in the same variables like appsAllocation 
> and recordingNodesAllocation in ActivitiesManager. I think these variables 
> should be thread-local to make activities clear among multiple threads.
>  2. Incomplete activities for multi-node lookup mechanism, since 
> ActivitiesLogger will skip recording through \{{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup mechanism.
>  3. Current app activities can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activities, add diagnosis for placement constraints check, update 
> insufficient resource diagnosis with detailed info (like 'insufficient 
> resource names:[memory-mb]') and so on.
>  4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
>  5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
>  6. Aggregate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggregation for app activities by 
> diagnoses is necessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnostics.
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.
> Running design doc is attached 
> [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-24 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10059:

Attachment: YARN-10059.001.patch

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-24 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10059:

Attachment: (was: YARN-10059.001.patch)

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-23 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002700#comment-17002700
 ] 

Tao Yang commented on YARN-10059:
-

Attached v1 patch for review.

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-23 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10059:

Attachment: YARN-10059.001.patch

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-23 Thread Tao Yang (Jira)
Tao Yang created YARN-10059:
---

 Summary: Final states of failed-to-localize containers are not 
recorded in NM state store
 Key: YARN-10059
 URL: https://issues.apache.org/jira/browse/YARN-10059
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Tao Yang
Assignee: Tao Yang


Currently we found an issue that many localizers of completed containers were 
launched and exhausted memory/cpu of that machine after NM restarted, these 
containers were all failed and completed when localizing on a non-existed local 
directory which is caused by another problem, but their final states weren't 
recorded in NM state store.
 The process flow of a fail-to-localize container is as follow:
{noformat}
ResourceLocalizationService$LocalizerRunner#run
-> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
LOCALIZATION_FAILED upon RESOURCE_FAILED
  dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
  -> ResourceLocalizationService#handleCleanupContainerResources  handle 
CLEANUP_CONTAINER_RESOURCES
  dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
  -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
{noformat}
There's no update for state store in this flow now, which is required to avoid 
unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9838) Fix resource inconsistency for queues when moving app with reserved container to another queue

2019-11-22 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9838:
---
Fix Version/s: 3.1.4
   3.2.2
   2.9.3
   3.3.0

> Fix resource inconsistency for queues when moving app with reserved container 
> to another queue
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Assignee: jiulongzhu
>Priority: Critical
>  Labels: patch
> Fix For: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9838) Fix resource inconsistency for queues when moving app with reserved container to another queue

2019-11-21 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9838:
---
Summary: Fix resource inconsistency for queues when moving app with 
reserved container to another queue  (was: Using the CapacityScheduler,Apply 
"movetoqueue" on the application which CS reserved containers for,will cause 
"Num Container" and "Used Resource" in ResourceUsage metrics error )

> Fix resource inconsistency for queues when moving app with reserved container 
> to another queue
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Assignee: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9635) Nodes page displayed duplicate nodes

2019-11-14 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974098#comment-16974098
 ] 

Tao Yang commented on YARN-9635:


Hi, [~jiwq]. I think the description of conf in NodeManager.md is not enough 
yet, we should add some details about this change like from which version and 
why.
[~sunilg], any thoughts about the new patch?

> Nodes page displayed duplicate nodes
> 
>
> Key: YARN-9635
> URL: https://issues.apache.org/jira/browse/YARN-9635
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager
>Affects Versions: 3.2.0
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Attachments: UI2-nodes.jpg, YARN-9635.001.patch, YARN-9635.002.patch
>
>
> Steps:
>  * shutdown nodes
>  * start nodes
> Nodes Page:
> !UI2-nodes.jpg!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9958) Remove the invalid lock in ContainerExecutor

2019-11-14 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974079#comment-16974079
 ] 

Tao Yang commented on YARN-9958:


Thanks [~jiwq] for this improvement. Patch LGTM, the related r/w lock only work 
for ContainerExecutor#pidFiles which is a concurrent hash map and no need to be 
guaranteed by additional lock.
 I will commit this a few days later if no further comments.

> Remove the invalid lock in ContainerExecutor
> 
>
> Key: YARN-9958
> URL: https://issues.apache.org/jira/browse/YARN-9958
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
>
> ContainerExecutor has ReadLock and WriteLock. These used to call get/put 
> method of ConcurrentMap. Due to the ConcurrentMap providing thread safety and 
> atomicity guarantees, so we can remove the lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-10-23 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957845#comment-16957845
 ] 

Tao Yang edited comment on YARN-7621 at 10/23/19 12:51 PM:
---

Hi, [~cane]. Sorry for the late reply.

It makes perfect sense for me to support duplicate queue names, as [~wilfreds] 
mentioned, there's more work to do for that.  I'm afraid of having no time to 
work on this recently, please feel free to take over this issue if you want, 
Thanks.


was (Author: tao yang):
Hi, [~cane]. Sorry for the late reply.

It's make perfect sense for me to support duplicate queue names, as [~wilfreds] 
mentioned, there's more work to do for that.  I'm afraid of having no time to 
work on this recently, please feel free to take over this issue if you want, 
Thanks.

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-10-23 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957845#comment-16957845
 ] 

Tao Yang commented on YARN-7621:


Hi, [~cane]. Sorry for the late reply.

It's make perfect sense for me to support duplicate queue names, as [~wilfreds] 
mentioned, there's more work to do for that.  I'm afraid of having no time to 
work on this recently, please feel free to take over this issue, Thanks.

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-10-23 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957845#comment-16957845
 ] 

Tao Yang edited comment on YARN-7621 at 10/23/19 12:48 PM:
---

Hi, [~cane]. Sorry for the late reply.

It's make perfect sense for me to support duplicate queue names, as [~wilfreds] 
mentioned, there's more work to do for that.  I'm afraid of having no time to 
work on this recently, please feel free to take over this issue if you want, 
Thanks.


was (Author: tao yang):
Hi, [~cane]. Sorry for the late reply.

It's make perfect sense for me to support duplicate queue names, as [~wilfreds] 
mentioned, there's more work to do for that.  I'm afraid of having no time to 
work on this recently, please feel free to take over this issue, Thanks.

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2019-10-15 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952049#comment-16952049
 ] 

Tao Yang commented on YARN-8737:


Thanks [~cheersyang] for the review. Submitted already.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2019-10-14 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951552#comment-16951552
 ] 

Tao Yang commented on YARN-8737:


Thanks [~Amithsha] for the feedback. Sorry to have forgot this issue for a long 
time.

[~cheersyang] & [~sunilg], Could you please help to review the patch?

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage

2019-10-13 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950671#comment-16950671
 ] 

Tao Yang edited comment on YARN-9838 at 10/14/19 3:17 AM:
--

Thanks [~jiulongZhu] for updating the patch.

LGTM, +1 for the patch. Last small suggestion is to add a blank line before the 
new test case,  which I can directly update before committing.

I will commit this if no further comments from others after a few days.


was (Author: tao yang):
Thanks [~jiulongZhu] for updating the patch.

LGTM, +1 for the patch. Last small suggestion is to add a blank line before the 
new test case.

I will commit this if no further comments from others after a few days.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri

2019-10-13 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950671#comment-16950671
 ] 

Tao Yang commented on YARN-9838:


Thanks [~jiulongZhu] for updating the patch.

LGTM, +1 for the patch. Last small suggestion is to add a blank line before the 
new test case.

I will commit this if no further comments from others after a few days.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage

2019-10-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949330#comment-16949330
 ] 

Tao Yang edited comment on YARN-9838 at 10/11/19 10:02 AM:
---

Thanks [~jiulongZhu] for fixing this issue. 
The patch LGTM in general,  some minor suggestions for the patch:
* check-style warnings need to be fixed, after that, you can run 
"dev-support/bin/test-patch /path/to/my.patch" to confirm.
* The indentation of updated log need to be adjusted and useless deletion of a 
blank line should be reverted in LeafQueue.
* The annotation "sync ResourceUsageByLabel ResourceUsageByUser and 
numContainer" can be removed since it seems unnecessary to add details here.
* As for UT, you can remove before-fixed block and just keep the correct 
verification.  Moreover, I think it's better to remove the method 
annotation("//YARN-9838") since we can find the source easily by git, and the 
annotation style "/\*\* \*/" often used for class or method, it's better to use 
"//" or "/\* \*/" in the method.


was (Author: tao yang):
Thanks [~jiulongZhu] for fixing this issue. 
The patch is LGTM in general,  some minor suggestions for the patch:
* check-style warnings need to be fixed, after that, you can run 
"dev-support/bin/test-patch /path/to/my.patch" to confirm.
* The indentation of updated log need to be adjusted and useless deletion of a 
blank line should be reverted in LeafQueue.
* The annotation "sync ResourceUsageByLabel ResourceUsageByUser and 
numContainer" can be removed since it seems unnecessary to add details here.
* As for UT, you can remove before-fixed block and just keep the correct 
verification.  Moreover, I think it's better to remove "//YARN-9838" since we 
can find the source easily by git, and the annotation style "/** */" often used 
for class or method, it's better to use "//" or "/* */" in the method.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics

2019-10-11 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9838:
---
Issue Type: Bug  (was: Improvement)

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics

2019-10-11 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9838:
---
Fix Version/s: (was: 2.7.3)

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri

2019-10-11 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949330#comment-16949330
 ] 

Tao Yang commented on YARN-9838:


Thanks [~jiulongZhu] for fixing this issue. 
The patch is LGTM in general,  some minor suggestions for the patch:
* check-style warnings need to be fixed, after that, you can run 
"dev-support/bin/test-patch /path/to/my.patch" to confirm.
* The indentation of updated log need to be adjusted and useless deletion of a 
blank line should be reverted in LeafQueue.
* The annotation "sync ResourceUsageByLabel ResourceUsageByUser and 
numContainer" can be removed since it seems unnecessary to add details here.
* As for UT, you can remove before-fixed block and just keep the correct 
verification.  Moreover, I think it's better to remove "//YARN-9838" since we 
can find the source easily by git, and the annotation style "/** */" often used 
for class or method, it's better to use "//" or "/* */" in the method.

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.3
>
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.

2019-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924672#comment-16924672
 ] 

Tao Yang edited comment on YARN-8995 at 9/7/19 12:33 AM:
-

Thanks [~jhung] for fixing this problem, sorry for missing changes about logger 
class in branch-3.1 and branch-3.2. 
Failures in jenkins report are cased by running environment, unrelated to the 
patch.
Patch LGTM and already tested in my local environment. Committing shortly.


was (Author: tao yang):
Thanks [~jhung] for fixing this problem, sorry for missing changes about logger 
class in branch-3.1.
Patch LGTM and already tested in my local environment. Committing shortly.

> Log events info in AsyncDispatcher when event queue size cumulatively reaches 
> a certain number every time.
> --
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: TestStreamPerf.java, 
> YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, 
> image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.

2019-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924672#comment-16924672
 ] 

Tao Yang commented on YARN-8995:


Thanks [~jhung] for fixing this problem, sorry for missing changes about logger 
class in branch-3.1.
Patch LGTM and already tested in my local environment. Committing shortly.

> Log events info in AsyncDispatcher when event queue size cumulatively reaches 
> a certain number every time.
> --
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: TestStreamPerf.java, 
> YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, 
> image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9817) Fix failing testcases due to not initialized AsyncDispatcher - ArithmeticException: / by zero

2019-09-06 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924659#comment-16924659
 ] 

Tao Yang commented on YARN-9817:


Thanks [~Prabhu Joseph] for raising this issue. 
Patch LGTM, committing now...

> Fix failing testcases due to not initialized AsyncDispatcher -  
> ArithmeticException: / by zero
> --
>
> Key: YARN-9817
> URL: https://issues.apache.org/jira/browse/YARN-9817
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.3.0, 3.2.1, 3.1.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9817-001.patch
>
>
> Below testcases failing as Asyncdispatcher throws ArithmeticException: / by 
> zero
> {code}
>  hadoop.mapreduce.v2.app.TestRuntimeEstimators 
>  hadoop.mapreduce.v2.app.job.impl.TestJobImpl 
>  hadoop.mapreduce.v2.app.TestMRApp 
> {code}
> Error Message:
> {code}
> [ERROR] testUpdatedNodes(org.apache.hadoop.mapreduce.v2.app.TestMRApp)  Time 
> elapsed: 0.847 s  <<< ERROR!
> java.lang.ArithmeticException: / by zero
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
>   at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1015)
>   at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:141)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1544)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1263)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at org.apache.hadoop.mapreduce.v2.app.MRApp.submit(MRApp.java:301)
>   at org.apache.hadoop.mapreduce.v2.app.MRApp.submit(MRApp.java:285)
>   at 
> org.apache.hadoop.mapreduce.v2.app.TestMRApp.testUpdatedNodes(TestMRApp.java:223)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> This happens when AsyncDispatcher is not initialized in the testcases and so 
> detailsInterval is taken as 0.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-05 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923891#comment-16923891
 ] 

Tao Yang commented on YARN-9795:


+1 for the latest patch.
I will commit this if no further comments from others.

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch, YARN-9795.002.patch, 
> YARN-9795.003.patch, YARN-9795.004.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-05 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923882#comment-16923882
 ] 

Tao Yang commented on YARN-9795:


Thanks [~fengnanli] for the update. A small suggestion is to remove null 
initial value for aMContainerAllocationDelay since it seems redundant.  Make 
sense?

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch, YARN-9795.002.patch, 
> YARN-9795.003.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923024#comment-16923024
 ] 

Tao Yang commented on YARN-9795:


Thanks [~fengnanli] for this improvement.
Patch almost LGTM,  IMO, there's no need to set -1 as the initial value of 
scheduledTime and add the special annotation, 0 should be the proper initial 
value like other times.  And new check-style warnings should be fixed as well.

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922996#comment-16922996
 ] 

Tao Yang commented on YARN-8995:


Hi, [~zhuqi], I found another place need to be improved.  {{ if (qSize % 
detailsInterval == 0) }} should be updated to {{ if (qSize != 0 && qSize % 
detailsInterval == 0 && lastEventDetailsQueueSizeLogged != qSize )}}, avoid 
printing for empty queue and print details redundantly. 

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922279#comment-16922279
 ] 

Tao Yang commented on YARN-8995:


Confirmed that latest patch should not fail like that. 
Now the patch LGTM, waiting for feedbacks from [~cheersyang], thanks.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921981#comment-16921981
 ] 

Tao Yang commented on YARN-8995:


Hi, [~zhuqi]. I noticed TestAsyncDispatcher#testPrintDispatcherEventDetails 
which is added by this patch failed 2 days ago, can you confirm why this 
happened? Even through it didn't happen again, I'm still afraid it may fail 
intermittently.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-01 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920568#comment-16920568
 ] 

Tao Yang commented on YARN-8995:


Thanks [~zhuqi] for the update.
Patch LGTM, could you please also fix the remaining check-style warnings? 
Hi, [~cheersyang], please help to review again, are these changes ok to you?

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-30 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919658#comment-16919658
 ] 

Tao Yang commented on YARN-9540:


Thanks [~abmodi], [~adam.antal] for the review and commit.

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-30 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919654#comment-16919654
 ] 

Tao Yang commented on YARN-9798:


Thanks [~abmodi] for the review. 
The frequency is only 1 or 2 failures in 2000 runs, and it didn't happen again 
after this fix.

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> 

[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-30 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9798:
---
Attachment: (was: YARN-9798.001.patch)

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401)
>   at 
> 

[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently

2019-08-30 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9798:
---
Attachment: YARN-9798.001.patch

> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails 
> intermittently
> -
>
> Key: YARN-9798
> URL: https://issues.apache.org/jira/browse/YARN-9798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9798.001.patch
>
>
> Found intermittent failure of 
> ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in 
> YARN-9714 jenkins report, the cause is that the assertion which will make 
> sure dispatcher has handled UNREGISTERED event but not wait until all events 
> in dispatcher are handled, we need to add {{rm.drainEvents()}} before that 
> assertion to fix this issue.
> Failure info:
> {noformat}
> [ERROR] 
> testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity)
>   Time elapsed: 0.559 s  <<< FAILURE!
> java.lang.AssertionError: Expecting only one event expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Standard output:
> {noformat}
> 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] 
> resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error 
> in handling event type REGISTERED for applicationAttempt 
> appattempt_1567061994047_0001_01
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276)
>   at 
> org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401)
>   at 
> 

[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919204#comment-16919204
 ] 

Tao Yang commented on YARN-9714:


Thanks [~rohithsharma], [~bibinchundatt] for the review and commit!

> ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
> -
>
> Key: YARN-9714
> URL: https://issues.apache.org/jira/browse/YARN-9714
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: memory-leak
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9714.001.patch, YARN-9714.002.patch, 
> YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch
>
>
> Recently RM full GC happened in one of our clusters, after investigating the 
> dump memory and jstack, I found two places in RM may cause memory leaks after 
> RM transitioned to standby:
>  # Release cache cleanup timer in AbstractYarnScheduler never be canceled.
>  # ZooKeeper connection in ZKRMStateStore never be closed.
> To solve those leaks, we should close the connection or cancel the timer when 
> services are stopping.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9803) NPE while accessing Scheduler UI

2019-08-29 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang resolved YARN-9803.

Resolution: Duplicate

Hi, [~yifan.stan]. This is a duplicate of YARN-9685, closing it as duplicate.

> NPE while accessing Scheduler UI
> 
>
> Key: YARN-9803
> URL: https://issues.apache.org/jira/browse/YARN-9803
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Xie YiFan
>Assignee: Xie YiFan
>Priority: Major
> Attachments: YARN-9803-branch-3.1.1.001.patch
>
>
> The same with what described in YARN-4624
> Scenario:
>  ===
> if not configure all queue's capacity to nodelabel even the value is 0, start 
> cluster and access capacityscheduler page.
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$LI.__(Hamlet.java:7709)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:342)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$LI.__(Hamlet.java:7709)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:513)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919097#comment-16919097
 ] 

Tao Yang edited comment on YARN-9540 at 8/30/19 2:00 AM:
-

Hi, [~adam.antal]. 
The cause is that the assertion which will make sure dispatcher have handled 
event but there is no wait before this assertion, we need to add 
{{rmDispatcher.await()}} like others in TestRMAppTransitions to fix this issue.
In my local test, about 5+ failures may happened in 1000 runs. After applying 
the patch, I didn't see it again.


was (Author: tao yang):
Hi, [~adam.antal]. 
The cause is that the assertion which will make sure dispatcher have handled 
event but not wait, we need to add {{rmDispatcher.await()}} before that 
assertion like others in TestRMAppTransitions to fix this issue.
In my local test, about 5+ failures may happened in 1000 runs. After applying 
the patch, I didn't see it again.

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> 

[jira] [Commented] (YARN-9540) TestRMAppTransitions fails intermittently

2019-08-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919097#comment-16919097
 ] 

Tao Yang commented on YARN-9540:


Hi, [~adam.antal]. 
The cause is that the assertion which will make sure dispatcher have handled 
event but not wait, we need to add {{rmDispatcher.await()}} before that 
assertion like others in TestRMAppTransitions to fix this issue.
In my local test, about 5+ failures may happened in 1000 runs. After applying 
the patch, I didn't see it again.

> TestRMAppTransitions fails intermittently
> -
>
> Key: YARN-9540
> URL: https://issues.apache.org/jira/browse/YARN-9540
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, test
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9540.001.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0]
> {code}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira

  1   2   3   4   5   6   7   8   9   10   >