[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2019-10-15 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951984#comment-16951984
 ] 

Weiwei Yang commented on YARN-8737:
---

Hi [~Tao Yang]

Change LGTM, could you pls submit the patch?

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metric

2019-10-13 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YARN-9838:
-

Assignee: jiulongzhu

> Using the CapacityScheduler,Apply "movetoqueue" on the application which CS 
> reserved containers for,will cause "Num Container" and "Used Resource" in 
> ResourceUsage metrics error 
> --
>
> Key: YARN-9838
> URL: https://issues.apache.org/jira/browse/YARN-9838
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: jiulongzhu
>Assignee: jiulongzhu
>Priority: Critical
>  Labels: patch
> Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, 
> YARN-9838.0001.patch, YARN-9838.0002.patch
>
>
>       In some clusters of ours, we are seeing "Used Resource","Used 
> Capacity","Absolute Used Capacity" and "Num Container" is positive or 
> negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In 
> extreme cases, apps couldn't be submitted to the queue that is actually idle 
> but the "Used Resource" is far more than zero, just like "Container Leak".
>       Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used 
> Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and 
> "Num Container" use the "numContainer" value kept by LeafQueue.And 
> AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will 
> change the state value of "numContainer" and "Used". Secondly, by comparing 
> the values numContainer and ResourceUsageByLabel and QueueMetrics 
> changed(#allocateContainer and #releaseContainer) logic of applications with 
> and without "movetoqueue",i found that moving the reservedContainers didn't 
> modify the "numContainer" value in AbstractCSQueue and "used" value in 
> ResourceUsage when the application was moved from a queue to another queue.
>         The metric values changed logic of reservedContainers are allocated, 
> and moved from $FROM queue to $TO queue, and released.The degree of increase 
> and decrease is not conservative, the Resource allocated from $FROM queue and 
> release to $TO queue.
> ||move reversedContainer||allocate||movetoqueue||release||
> |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the 
> same,$TO queue stay the same{color}|decrease  in $TO queue|
> |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM 
> queue stay the same,$TO queue stay the same{color}|decrease  in $TO queue |
> |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in 
> $TO queue|decrease  in $TO queue|
>       The metric values changed logic of allocatedContainer(allocated, 
> acquired, running) are allocated, and movetoqueue, and released are 
> absolutely conservative.
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2255) YARN Audit logging not added to log4j.properties

2019-09-17 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931952#comment-16931952
 ] 

Weiwei Yang commented on YARN-2255:
---

I have committed this to trunk, 3.2 and 3.1. 3.0 is skipped because it is EOL 
according to 
[https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches].
 

> YARN Audit logging not added to log4j.properties
> 
>
> Key: YARN-2255
> URL: https://issues.apache.org/jira/browse/YARN-2255
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Varun Saxena
>Assignee: Aihua Xu
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-2255.1.patch, YARN-2255.patch
>
>
> log4j.properties file which is part of the hadoop package, doesnt have YARN 
> Audit logging tied to it. This leads to audit logs getting generated in 
> normal log files. Audit logs should be generated in a separate log file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2255) YARN Audit logging not added to log4j.properties

2019-09-17 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931947#comment-16931947
 ] 

Weiwei Yang commented on YARN-2255:
---

Sorry for the late response [~aihuaxu], this got slipped away from my list.

I am committing this now.

> YARN Audit logging not added to log4j.properties
> 
>
> Key: YARN-2255
> URL: https://issues.apache.org/jira/browse/YARN-2255
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Varun Saxena
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-2255.1.patch, YARN-2255.patch
>
>
> log4j.properties file which is part of the hadoop package, doesnt have YARN 
> Audit logging tied to it. This leads to audit logs getting generated in 
> normal log files. Audit logs should be generated in a separate log file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2255) YARN Audit logging not added to log4j.properties

2019-09-10 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927261#comment-16927261
 ] 

Weiwei Yang commented on YARN-2255:
---

Thanks [~aihuaxu]. Patch LGTM.

Thanks for putting it back on the table, let's get this fixed.

I haven't tried this, I assumed you have verified this can work well on your 
env, correct?

> YARN Audit logging not added to log4j.properties
> 
>
> Key: YARN-2255
> URL: https://issues.apache.org/jira/browse/YARN-2255
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Varun Saxena
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-2255.1.patch, YARN-2255.patch
>
>
> log4j.properties file which is part of the hadoop package, doesnt have YARN 
> Audit logging tied to it. This leads to audit logs getting generated in 
> normal log files. Audit logs should be generated in a separate log file



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922309#comment-16922309
 ] 

Weiwei Yang commented on YARN-8995:
---

Also looks good to me, [~Tao Yang], feel free to commit this.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2019-08-29 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918495#comment-16918495
 ] 

Weiwei Yang commented on YARN-9538:
---

Hi [~Tao Yang]

Can you please create MD files based on the google doc, once you've done that, 
you can locally generate html files for preview.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9050) [Umbrella] Usability improvements for scheduler activities

2019-08-29 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9050:
--
Fix Version/s: 3.3.0

> [Umbrella] Usability improvements for scheduler activities
> --
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
>  1. Not available for multi-thread asynchronous scheduling. App and node 
> activities maybe confused when multiple scheduling threads record activities 
> of different allocation processes in the same variables like appsAllocation 
> and recordingNodesAllocation in ActivitiesManager. I think these variables 
> should be thread-local to make activities clear among multiple threads.
>  2. Incomplete activities for multi-node lookup mechanism, since 
> ActivitiesLogger will skip recording through \{{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup mechanism.
>  3. Current app activities can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activities, add diagnosis for placement constraints check, update 
> insufficient resource diagnosis with detailed info (like 'insufficient 
> resource names:[memory-mb]') and so on.
>  4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
>  5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
>  6. Aggregate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggregation for app activities by 
> diagnoses is necessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnostics.
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.
> Running design doc is attached 
> [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-29 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918472#comment-16918472
 ] 

Weiwei Yang commented on YARN-9664:
---

+1 committing now.

Thanks [~Tao Yang]

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch, 
> YARN-9664.003.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-29 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918421#comment-16918421
 ] 

Weiwei Yang commented on YARN-9664:
---

UT seems not related to this patch, [~Tao Yang], could you please confirm?

Other than that, I am +1 to v3 patch.

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch, 
> YARN-9664.003.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-28 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918260#comment-16918260
 ] 

Weiwei Yang commented on YARN-9664:
---

Thanks [~Tao Yang].

For the single placement node, that is confusing, I think we can just remove it.

"Initial check: node has been removed from scheduler"

"Initial check: node resource is insufficient for minimum allocation"

"Queue skipped because node has been reserved"

"Queue skipped because node resource is insufficient"

and for the locality,

"Node skipped because of no off-switch and locality violation" -> "Node skipped 
because node/rack locality cannot be satisfied"

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-28 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917798#comment-16917798
 ] 

Weiwei Yang commented on YARN-9664:
---

Hello [~Tao Yang]

I just went through the patch. Most of the changes stay in activities package, 
that's good.

*ActivitiesTestUtils*
{code:java}
getFirstSubNodeFromJson(JSONObject json, String... hierachicalFieldNames)
{code}
typo: {{hierachicalFieldNames}} -> {{hierarchicalFieldNames}}

*ActivitiesUtils*

Line 56: I noticed that the 1st filter is to filter out null objects, can we do 
this like following?
{code:java}
activityNodes.stream().filter(e -> Objects.nonNull(e))...
{code}
*ActivityDiagnosticConstant*
{code:java}
 public final static String INIT_CHECK_SINGLE_NODE_REMOVED = "Initial check: "
  + "single placement node has been removed from scheduler";
{code}
what does "single placement node" mean here?
{code:java}
 public final static String QUEUE_SKIPPED_TO_RESPECT_FIFO = "Queue skipped "
  + "following applications in the queue to respect FIFO of applications";
{code}
I think we can remove "in the queue", that seems redundant.
{code:java}
  public final static String
  NODE_SKIPPED_BECAUSE_OF_NO_OFF_SWITCH_AND_LOCALITY_VIOLATION =
  "Node skipped because of no off-switch and locality violation";
{code}
I am also not quite sure what does this mean, can you please elaborate?

*ParentQueue*

line 650: is it safe to the check: "if (node != null && !isReserved)" here?

Thanks

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding

2019-08-27 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917421#comment-16917421
 ] 

Weiwei Yang commented on YARN-9664:
---

Just got time to take a look at this.
Wow, this is a huge set of changes.  [~Tao Yang], have you verified the output 
of these changes are expected?
I will try to go through the changes and hopefully, I can contribute enough 
review comments.
Thx

> Improve response of scheduler/app activities for better understanding
> -
>
> Key: YARN-9664
> URL: https://issues.apache.org/jira/browse/YARN-9664
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9664.001.patch, YARN-9664.002.patch
>
>
> Currently some diagnostics are not easy enough to understand for common 
> users, and I found some places still need to be improved such as no partition 
> information and lacking of necessary activities. This issue is to improve 
> these shortcomings.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-20 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911501#comment-16911501
 ] 

Weiwei Yang edited comment on YARN-8995 at 8/20/19 4:08 PM:


Hi [~zhuqi]/[~Tao Yang]

Thanks for working on this. Patch LGTM, I might be just a little picky on the 
configuration name, right now it is not straightforward to me.

"The interval of queue size (in thousands) for printing the boom queue event 
type details."

How about something like the following for the description, if I understand 
this correctly:

"The threshold used to trigger the logging of event types and counts in RM's 
main event dispatcher. Default length is 5000, which means RM will print events 
info when the queue size cumulatively reaches 5000 every time.  Such info can 
be used to reveal what kind of events that RM is stuck at processing mostly, it 
can help to narrow down certain performance issues."

And also, the config name is better to be something like 
{{yarn.dispatcher.print-events-info.threshold}}, you don't need to use 
in-thousands here, as several thousand is still human-readable.

Does that make sense?

Thanks


was (Author: cheersyang):
Hi [~zhuqi]/[~Tao Yang]

Thanks for working on this. Patch LGTM, I might be just a little picky on the 
configuration name, right now it is not straightforward to me.
{noformat}
The interval of queue size (in thousands) for printing the boom queue event 
type details.
{noformat}
How about something like the following for the description, if I understand 
this correctly:
{noformat}
The threshold used to trigger the logging of event types and counts in RM's 
main event dispatcher. Default length is 5000, which means RM will print events 
info when the queue size cumulatively reaches 5000 every time.  Such info can 
be used to reveal what kind of events that RM is stuck at processing mostly, it 
can help to narrow down certain performance issues.
{noformat}
And also, the config name is better to be something like 
{{yarn.dispatcher.print-events-info.threshold}}, you don't need to use 
in-thousands here, as several thousand is still human-readable.

Does that make sense?

Thanks

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-20 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911501#comment-16911501
 ] 

Weiwei Yang commented on YARN-8995:
---

Hi [~zhuqi]/[~Tao Yang]

Thanks for working on this. Patch LGTM, I might be just a little picky on the 
configuration name, right now it is not straightforward to me.
{noformat}
The interval of queue size (in thousands) for printing the boom queue event 
type details.
{noformat}
How about something like the following for the description, if I understand 
this correctly:
{noformat}
The threshold used to trigger the logging of event types and counts in RM's 
main event dispatcher. Default length is 5000, which means RM will print events 
info when the queue size cumulatively reaches 5000 every time.  Such info can 
be used to reveal what kind of events that RM is stuck at processing mostly, it 
can help to narrow down certain performance issues.
{noformat}
And also, the config name is better to be something like 
{{yarn.dispatcher.print-events-info.threshold}}, you don't need to use 
in-thousands here, as several thousand is still human-readable.

Does that make sense?

Thanks

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2599) Standby RM should also expose some jmx and metrics

2019-08-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909036#comment-16909036
 ] 

Weiwei Yang commented on YARN-2599:
---

LGTM, that minor check-style issue can be fixed during commit. +1.

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1, 2.7.3, 3.0.0-alpha1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-2599.002.patch, YARN-2599.patch
>
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9733) Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when subprocess of container dead

2019-08-09 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903612#comment-16903612
 ] 

Weiwei Yang commented on YARN-9733:
---

Sure, added you as a contributor, assigned this issue to you. Thx.

> Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when 
> subprocess of container dead
> 
>
> Key: YARN-9733
> URL: https://issues.apache.org/jira/browse/YARN-9733
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: qian han
>Assignee: qian han
>Priority: Major
>
> The method getTotalProcessJiffies only gets jiffies for running processes not 
> dead processes.
> For example, process pid100 and its children pid200 and pid300.
> We call getCpuUsagePercent the first time, assume that pid100 has a jiffies 
> 1000, pid200 2000 and pid300 3000. The totalProcessJiffies1 is 6000.
> And We kill pid300. Then we call getCpuUsagePercent the second time, assume 
> that pid100 has a jiffies 1100, pid200 2200. The totalProcessJiffies2 is 3300.
> So we got a cpu usage percent 0.
> I would like to fix this bug.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9733) Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when subprocess of container dead

2019-08-09 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YARN-9733:
-

Assignee: qian han

> Method getCpuUsagePercent in Class ProcfsBasedProcessTree return 0 when 
> subprocess of container dead
> 
>
> Key: YARN-9733
> URL: https://issues.apache.org/jira/browse/YARN-9733
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: qian han
>Assignee: qian han
>Priority: Major
>
> The method getTotalProcessJiffies only gets jiffies for running processes not 
> dead processes.
> For example, process pid100 and its children pid200 and pid300.
> We call getCpuUsagePercent the first time, assume that pid100 has a jiffies 
> 1000, pid200 2000 and pid300 3000. The totalProcessJiffies1 is 6000.
> And We kill pid300. Then we call getCpuUsagePercent the second time, assume 
> that pid100 has a jiffies 1100, pid200 2200. The totalProcessJiffies2 is 3300.
> So we got a cpu usage percent 0.
> I would like to fix this bug.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-26 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893856#comment-16893856
 ] 

Weiwei Yang commented on YARN-7621:
---

[~cane], could you pls help to review [~Tao Yang]'s patch? Just want to cross 
check.

Thanks

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7621:
--
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-9698

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2019-07-25 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9698:
--
Labels: fs2cs  (was: )

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2019-07-25 Thread Weiwei Yang (JIRA)
Weiwei Yang created YARN-9698:
-

 Summary: [Umbrella] Tools to help migration from Fair Scheduler to 
Capacity Scheduler
 Key: YARN-9698
 URL: https://issues.apache.org/jira/browse/YARN-9698
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Weiwei Yang


We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
this Jira is created as an umbrella to track all related efforts for the 
migration, the scope contains
 * Bug fixes
 * Add missing features
 * Migration tools that help to generate CS configs based on FS, validate 
configs etc
 * Documents

this is part of CS component, the purpose is to make the migration process 
smooth.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9687) Queue headroom check may let unacceptable allocation off when using DominantResourceCalculator

2019-07-22 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890170#comment-16890170
 ] 

Weiwei Yang commented on YARN-9687:
---

This is related to the core resource calculator, it would be good to have 
[~sunilg] to take a look too. [~sunilg], could you please help to review this? 
Thx

> Queue headroom check may let unacceptable allocation off when using 
> DominantResourceCalculator
> --
>
> Key: YARN-9687
> URL: https://issues.apache.org/jira/browse/YARN-9687
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9687.001.patch
>
>
> Currently queue headroom check in {{RegularContainerAllocator#checkHeadroom}} 
> is using {{Resources#greaterThanOrEqual}} which internally compare resources 
> by ratio, when using DominantResourceCalculator, it may let unacceptable 
> allocations off in some scenarios.
> For example:
> cluster-resource=<10GB, 10 vcores>
> queue-headroom=<2GB, 4 vcores>
> required-resource=<3GB, 1 vcores>
> In this way, headroom ratio(0.4) is greater than the required ratio(0.3), so 
> that allocations will be let off in scheduling process but will always be 
> rejected when committing these proposals.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9682) Wrong log message when finalizing the upgrade

2019-07-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886647#comment-16886647
 ] 

Weiwei Yang commented on YARN-9682:
---

Pushed to trunk, cherry-picked to branch-3.2 and branch-3.1. Thanks for the 
contribution [~kyungwan nam].

> Wrong log message when finalizing the upgrade
> -
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9682) Wrong log message when finalizing the upgrade

2019-07-16 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9682:
--
Fix Version/s: 3.1.3

> Wrong log message when finalizing the upgrade
> -
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9682) Wrong log message when finalizing the upgrade

2019-07-16 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9682:
--
Fix Version/s: 3.2.1

> Wrong log message when finalizing the upgrade
> -
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9682) Wrong log message when finalizing the upgrade

2019-07-16 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9682:
--
Summary: Wrong log message when finalizing the upgrade  (was: wrong log 
message when finalize upgrade)

> Wrong log message when finalizing the upgrade
> -
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9682) wrong log message when finalize upgrade

2019-07-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886630#comment-16886630
 ] 

Weiwei Yang commented on YARN-9682:
---

+1, committing shortly.

> wrong log message when finalize upgrade
> ---
>
> Key: YARN-9682
> URL: https://issues.apache.org/jira/browse/YARN-9682
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Trivial
> Attachments: YARN-9682.001.patch
>
>
> I've seen the wrong message as follows when finalize-upgrade for a 
> yarn-service
> {code:java}
> 2019-07-16 17:44:09,204 INFO  client.ServiceClient 
> (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} 
> upgrade{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2019-07-10 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881876#comment-16881876
 ] 

Weiwei Yang commented on YARN-9538:
---

Hi [~Tao Yang]

I just read the v3 doc, it looks pretty good now.

I have added my comments in the doc directly, please take a look. 

Thanks

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-02 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877465#comment-16877465
 ] 

Weiwei Yang commented on YARN-7621:
---

cc [~leftnoteasy], [~sunilg], [~wilfreds].

This issue is important for users who want to migrate FS to CS.

Adding a label to tag it.

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-02 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7621:
--
Priority: Major  (was: Minor)

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-02 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7621:
--
Labels: fs2cs  (was: )

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9658) Fix UT failures in TestLeafQueue

2019-07-02 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9658:
--
Summary: Fix UT failures in TestLeafQueue  (was: UT failures in 
TestLeafQueue)

> Fix UT failures in TestLeafQueue
> 
>
> Key: YARN-9658
> URL: https://issues.apache.org/jira/browse/YARN-9658
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9658.001.patch
>
>
> In ActivitiesManager, if there's no yarn configuration in mock RMContext, 
> cleanup interval can't be initialized to 5 seconds by default, causing the 
> cleanup thread keeps running repeatedly without interval which may bring 
> problems to mockito framework, it caused OOM in this case, internally many 
> throwable objects were generated by incomplete mock.
> Add configuration for mock RMContext to fix failures in TestLeafQueue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9658) UT failures in TestLeafQueue

2019-07-02 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877464#comment-16877464
 ] 

Weiwei Yang commented on YARN-9658:
---

+1

> UT failures in TestLeafQueue
> 
>
> Key: YARN-9658
> URL: https://issues.apache.org/jira/browse/YARN-9658
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9658.001.patch
>
>
> In ActivitiesManager, if there's no yarn configuration in mock RMContext, 
> cleanup interval can't be initialized to 5 seconds by default, causing the 
> cleanup thread keeps running repeatedly without interval which may bring 
> problems to mockito framework, it caused OOM in this case, internally many 
> throwable objects were generated by incomplete mock.
> Add configuration for mock RMContext to fix failures in TestLeafQueue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-02 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9655:
--
Fix Version/s: 2.9.3

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: YARN-9655.branch-2.9.patch, YARN-9655.branch-3.0.patch
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-02 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9655:
--
Fix Version/s: 3.0.4

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9655.branch-2.9.patch, YARN-9655.branch-3.0.patch
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-01 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876632#comment-16876632
 ] 

Weiwei Yang commented on YARN-9655:
---

Thanks [~hunhun], re-opened the issue to trigger jenkins job.

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9655.branch-2.9.patch
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-01 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reopened YARN-9655:
---

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9655.branch-2.9.patch
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-01 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876620#comment-16876620
 ] 

Weiwei Yang commented on YARN-9655:
---

I just pushed this to trunk, cherry-picked to branch-3.2, branch-3.1. Thanks 
for the contribution [~hunhun].

FederationInterceptor was added in 2.9, does this issue also exist in 
branch-2.9 and branch-3.0? If they do, then we need to provide a patch for 
branch-2.9, branch-2 and branch-3.0. [~hunhun], please let me know, thanks.

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-01 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9655:
--
Fix Version/s: 3.1.3

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-01 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9655:
--
Fix Version/s: 3.2.1

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-07-01 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YARN-9655.
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.3.0

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Fix For: 3.3.0
>
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered

2019-07-01 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875986#comment-16875986
 ] 

Weiwei Yang commented on YARN-9623:
---

Hi [~Tao Yang], pls create a new issue to fix this failure. Thanks

 

> Auto adjust max queue length of app activities to make sure activities on all 
> nodes can be covered
> --
>
> Key: YARN-9623
> URL: https://issues.apache.org/jira/browse/YARN-9623
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9623.001.patch, YARN-9623.002.patch
>
>
> Currently we can use configuration entry 
> "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
> control max queue length of app activities, but in some scenarios , this 
> configuration may need to be updated in a growing cluster. Moreover, it's 
> better for users to ignore that conf therefor it should be auto adjusted 
> internally.
>  There are some differences among different scheduling modes:
>  * multi-node placement disabled
>  ** Heartbeat driven scheduling: max queue length of app activities should 
> not less than the number of nodes, considering nodes can not be always in 
> order, we should make some room for misorder, for example, we can guarantee 
> that max queue length should not be less than 1.2 * numNodes
>  ** Async scheduling: every async scheduling thread goes through all nodes in 
> order, in this mode, we should guarantee that max queue length should be 
> numThreads * numNodes.
>  * multi-node placement enabled: activities on all nodes can be involved in a 
> single app allocation, therefor there's no need to adjust for this mode.
> To sum up, we can adjust the max queue length of app activities like this:
> {code}
> int configuredMaxQueueLength;
> int maxQueueLength;
> serviceInit(){
>   ...
>   configuredMaxQueueLength = ...; //read configured max queue length
>   maxQueueLength = configuredMaxQueueLength; //take configured value as 
> default
> }
> CleanupThread#run(){
>   ...
>   if (multiNodeDisabled) {
> if (asyncSchedulingEnabled) {
>maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
> numNodes);
> } else {
>maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
> }
>   } else if (maxQueueLength != configuredMaxQueueLength) {
> maxQueueLength = configuredMaxQueueLength;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-06-28 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875037#comment-16875037
 ] 

Weiwei Yang commented on YARN-9655:
---

Oops. The fix was simple and that makes me ignore there is no UT for this, let 
me revert the commit for now.

[~hunhun] can you help to add a UT to cover this NPE issue?

Thanks

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-06-28 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875035#comment-16875035
 ] 

Weiwei Yang commented on YARN-9655:
---

+1. committing shortly.

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-06-28 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YARN-9655:
-

Assignee: hunshenshi

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered

2019-06-28 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874994#comment-16874994
 ] 

Weiwei Yang commented on YARN-9623:
---

Pushed to trunk, thanks for the contribution [~Tao Yang].

> Auto adjust max queue length of app activities to make sure activities on all 
> nodes can be covered
> --
>
> Key: YARN-9623
> URL: https://issues.apache.org/jira/browse/YARN-9623
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9623.001.patch, YARN-9623.002.patch
>
>
> Currently we can use configuration entry 
> "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
> control max queue length of app activities, but in some scenarios , this 
> configuration may need to be updated in a growing cluster. Moreover, it's 
> better for users to ignore that conf therefor it should be auto adjusted 
> internally.
>  There are some differences among different scheduling modes:
>  * multi-node placement disabled
>  ** Heartbeat driven scheduling: max queue length of app activities should 
> not less than the number of nodes, considering nodes can not be always in 
> order, we should make some room for misorder, for example, we can guarantee 
> that max queue length should not be less than 1.2 * numNodes
>  ** Async scheduling: every async scheduling thread goes through all nodes in 
> order, in this mode, we should guarantee that max queue length should be 
> numThreads * numNodes.
>  * multi-node placement enabled: activities on all nodes can be involved in a 
> single app allocation, therefor there's no need to adjust for this mode.
> To sum up, we can adjust the max queue length of app activities like this:
> {code}
> int configuredMaxQueueLength;
> int maxQueueLength;
> serviceInit(){
>   ...
>   configuredMaxQueueLength = ...; //read configured max queue length
>   maxQueueLength = configuredMaxQueueLength; //take configured value as 
> default
> }
> CleanupThread#run(){
>   ...
>   if (multiNodeDisabled) {
> if (asyncSchedulingEnabled) {
>maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
> numNodes);
> } else {
>maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
> }
>   } else if (maxQueueLength != configuredMaxQueueLength) {
> maxQueueLength = configuredMaxQueueLength;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered

2019-06-28 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874991#comment-16874991
 ] 

Weiwei Yang commented on YARN-9623:
---

+1, committing shortly.

> Auto adjust max queue length of app activities to make sure activities on all 
> nodes can be covered
> --
>
> Key: YARN-9623
> URL: https://issues.apache.org/jira/browse/YARN-9623
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9623.001.patch, YARN-9623.002.patch
>
>
> Currently we can use configuration entry 
> "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
> control max queue length of app activities, but in some scenarios , this 
> configuration may need to be updated in a growing cluster. Moreover, it's 
> better for users to ignore that conf therefor it should be auto adjusted 
> internally.
>  There are some differences among different scheduling modes:
>  * multi-node placement disabled
>  ** Heartbeat driven scheduling: max queue length of app activities should 
> not less than the number of nodes, considering nodes can not be always in 
> order, we should make some room for misorder, for example, we can guarantee 
> that max queue length should not be less than 1.2 * numNodes
>  ** Async scheduling: every async scheduling thread goes through all nodes in 
> order, in this mode, we should guarantee that max queue length should be 
> numThreads * numNodes.
>  * multi-node placement enabled: activities on all nodes can be involved in a 
> single app allocation, therefor there's no need to adjust for this mode.
> To sum up, we can adjust the max queue length of app activities like this:
> {code}
> int configuredMaxQueueLength;
> int maxQueueLength;
> serviceInit(){
>   ...
>   configuredMaxQueueLength = ...; //read configured max queue length
>   maxQueueLength = configuredMaxQueueLength; //take configured value as 
> default
> }
> CleanupThread#run(){
>   ...
>   if (multiNodeDisabled) {
> if (asyncSchedulingEnabled) {
>maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
> numNodes);
> } else {
>maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
> }
>   } else if (maxQueueLength != configuredMaxQueueLength) {
> maxQueueLength = configuredMaxQueueLength;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2019-06-27 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874571#comment-16874571
 ] 

Weiwei Yang commented on YARN-6629:
---

Hi [~aihuaxu], that's correct, this will be included in 2.10. But if you need 
it in the next 2.9.x release, then we need to backport to branch-2.9.

> NPE occurred when container allocation proposal is applied but its resource 
> requests are removed before
> ---
>
> Key: YARN-6629
> URL: https://issues.apache.org/jira/browse/YARN-6629
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.1.0, 2.10.0
>
> Attachments: YARN-6629.001.patch, YARN-6629.002.patch, 
> YARN-6629.003.patch, YARN-6629.004.patch, YARN-6629.005.patch, 
> YARN-6629.006.patch, YARN-6629.branch-2.001.patch
>
>
> I wrote a test case to reproduce another problem for branch-2 and found new 
> NPE error,  log: 
> {code}
> FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
> at 
> org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
> at 
> org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
> at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
> at 
> org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Reproduce this error in chronological order:
> 1. AM started and requested 1 container with schedulerRequestKey#1 : 
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests 
> Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
> 2. Scheduler allocatd 1 container for this request and accepted the proposal
> 3. AM removed this request
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests --> 
> AppSchedulingInfo#addToPlacementSets --> 
> AppSchedulingInfo#updatePendingResources
> Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
> 4. Scheduler applied this proposal
> CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
> AppSchedulingInfo#allocate 
> Throw NPE when called 
> schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
> type, node);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority

2019-06-27 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874185#comment-16874185
 ] 

Weiwei Yang commented on YARN-9655:
---

LGTM.

Not sure if this can get some folks familiar with this to review.

[~hunhun], can you fix the checkstyle issue? It's simple you just need to cut 
the line less than 80 chars.

Thanks

> AllocateResponse in FederationInterceptor lost  applicationPriority
> ---
>
> Key: YARN-9655
> URL: https://issues.apache.org/jira/browse/YARN-9655
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
>
> In YARN Federation mode using FederationInterceptor, when submitting 
> application, am will report an error.
> {code:java}
> 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The reason is that applicationPriority is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-06-27 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874181#comment-16874181
 ] 

Weiwei Yang commented on YARN-9642:
---

Sorry for getting to this late. It's a good catch,  +1.

Thanks [~bibinchundatt], [~sunilg].

 

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9642.001.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2019-06-27 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874176#comment-16874176
 ] 

Weiwei Yang commented on YARN-6629:
---

hi [~aihuaxu], [~Tao Yang]

Feel free to create another Jira for the backport, loop me in and I'll help to 
review/commit.

Thanks.

> NPE occurred when container allocation proposal is applied but its resource 
> requests are removed before
> ---
>
> Key: YARN-6629
> URL: https://issues.apache.org/jira/browse/YARN-6629
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.1.0, 2.10.0
>
> Attachments: YARN-6629.001.patch, YARN-6629.002.patch, 
> YARN-6629.003.patch, YARN-6629.004.patch, YARN-6629.005.patch, 
> YARN-6629.006.patch, YARN-6629.branch-2.001.patch
>
>
> I wrote a test case to reproduce another problem for branch-2 and found new 
> NPE error,  log: 
> {code}
> FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
> at 
> org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
> at 
> org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
> at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
> at 
> org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Reproduce this error in chronological order:
> 1. AM started and requested 1 container with schedulerRequestKey#1 : 
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests 
> Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
> 2. Scheduler allocatd 1 container for this request and accepted the proposal
> 3. AM removed this request
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests --> 
> AppSchedulingInfo#addToPlacementSets --> 
> AppSchedulingInfo#updatePendingResources
> Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
> 4. Scheduler applied this proposal
> CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
> AppSchedulingInfo#allocate 
> Throw NPE when called 
> schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
> type, node);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered

2019-06-27 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874168#comment-16874168
 ] 

Weiwei Yang commented on YARN-9623:
---

Hi [~Tao Yang]

OK, I am fine with that. However, we still need to configuration 
\{{yarn.resourcemanager.activities-manager.app-activities.max-queue-length}} to 
be there. If this configuration is set, then the value should be enforced for 
the queue size and disable the auto-adjustment. Can you add that logic?

This is to ensure we have a workaround if the auto-calculation is suboptimal. 
Hope that makes sense.

Thanks

> Auto adjust max queue length of app activities to make sure activities on all 
> nodes can be covered
> --
>
> Key: YARN-9623
> URL: https://issues.apache.org/jira/browse/YARN-9623
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9623.001.patch
>
>
> Currently we can use configuration entry 
> "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
> control max queue length of app activities, but in some scenarios , this 
> configuration may need to be updated in a growing cluster. Moreover, it's 
> better for users to ignore that conf therefor it should be auto adjusted 
> internally.
>  There are some differences among different scheduling modes:
>  * multi-node placement disabled
>  ** Heartbeat driven scheduling: max queue length of app activities should 
> not less than the number of nodes, considering nodes can not be always in 
> order, we should make some room for misorder, for example, we can guarantee 
> that max queue length should not be less than 1.2 * numNodes
>  ** Async scheduling: every async scheduling thread goes through all nodes in 
> order, in this mode, we should guarantee that max queue length should be 
> numThreads * numNodes.
>  * multi-node placement enabled: activities on all nodes can be involved in a 
> single app allocation, therefor there's no need to adjust for this mode.
> To sum up, we can adjust the max queue length of app activities like this:
> {code}
> int configuredMaxQueueLength;
> int maxQueueLength;
> serviceInit(){
>   ...
>   configuredMaxQueueLength = ...; //read configured max queue length
>   maxQueueLength = configuredMaxQueueLength; //take configured value as 
> default
> }
> CleanupThread#run(){
>   ...
>   if (multiNodeDisabled) {
> if (asyncSchedulingEnabled) {
>maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
> numNodes);
> } else {
>maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
> }
>   } else if (maxQueueLength != configuredMaxQueueLength) {
> maxQueueLength = configuredMaxQueueLength;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered

2019-06-27 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873966#comment-16873966
 ] 

Weiwei Yang commented on YARN-9623:
---

Hi [~Tao Yang]

Generally, I think this is a good approach, to have fewer configs.

However, the activity manager should be a general service, it should not be 
depending on CS's configuration, for example, the number of async threads. How 
about to let it just be {{1.2 * numOfNodes}} for both cases and see how this 
works? We can continue to tune this after we have more experience to use this 
in real clusters.

Another thing is {{appActivitiesMaxQueueLength}}, do we need to make it atomic 
because it is being modified in another thread.

Thanks

 

> Auto adjust max queue length of app activities to make sure activities on all 
> nodes can be covered
> --
>
> Key: YARN-9623
> URL: https://issues.apache.org/jira/browse/YARN-9623
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9623.001.patch
>
>
> Currently we can use configuration entry 
> "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to 
> control max queue length of app activities, but in some scenarios , this 
> configuration may need to be updated in a growing cluster. Moreover, it's 
> better for users to ignore that conf therefor it should be auto adjusted 
> internally.
>  There are some differences among different scheduling modes:
>  * multi-node placement disabled
>  ** Heartbeat driven scheduling: max queue length of app activities should 
> not less than the number of nodes, considering nodes can not be always in 
> order, we should make some room for misorder, for example, we can guarantee 
> that max queue length should not be less than 1.2 * numNodes
>  ** Async scheduling: every async scheduling thread goes through all nodes in 
> order, in this mode, we should guarantee that max queue length should be 
> numThreads * numNodes.
>  * multi-node placement enabled: activities on all nodes can be involved in a 
> single app allocation, therefor there's no need to adjust for this mode.
> To sum up, we can adjust the max queue length of app activities like this:
> {code}
> int configuredMaxQueueLength;
> int maxQueueLength;
> serviceInit(){
>   ...
>   configuredMaxQueueLength = ...; //read configured max queue length
>   maxQueueLength = configuredMaxQueueLength; //take configured value as 
> default
> }
> CleanupThread#run(){
>   ...
>   if (multiNodeDisabled) {
> if (asyncSchedulingEnabled) {
>maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * 
> numNodes);
> } else {
>maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes);
> }
>   } else if (maxQueueLength != configuredMaxQueueLength) {
> maxQueueLength = configuredMaxQueueLength;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9451) AggregatedLogsBlock shows wrong NM http port

2019-06-24 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871431#comment-16871431
 ] 

Weiwei Yang commented on YARN-9451:
---

Thanks. Patch LGTM, I guess that to use the HTTP port is better for people to 
find out which NM this is. However, only one thing about the log, this is quite 
confusing...
{noformat}
try the nodemanager at yarn-ats-3:45454
{noformat}
 shouldn't it be something like,
{noformat}
try to find the container logs in the local directory of nodemanager 
yarn-ats-3:45454
{noformat}
can we fix that too?
Thanks

> AggregatedLogsBlock shows wrong NM http port
> 
>
> Key: YARN-9451
> URL: https://issues.apache.org/jira/browse/YARN-9451
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: Screen Shot 2019-06-20 at 7.49.46 PM.png, 
> YARN-9451-001.patch, YARN-9451-002.patch
>
>
> AggregatedLogsBlock shows wrong NM http port when aggregated file is not 
> available. It shows [http://yarn-ats-3:45454|http://yarn-ats-3:45454/] - NM 
> rpc port instead of http port.
> {code:java}
> Logs not available for job_1554476304275_0003. Aggregation may not be 
> complete, Check back later or try the nodemanager at yarn-ats-3:45454
> Or see application log at 
> http://yarn-ats-3:45454/node/application/application_1554476304275_0003
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-06-21 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869361#comment-16869361
 ] 

Weiwei Yang commented on YARN-9209:
---

Thanks [~tarunparimi]. I just committed this to trunk and cherry-picked to 
branch-3.2.

> When nodePartition is not set in Placement Constraints, containers are 
> allocated only in default partition
> --
>
> Key: YARN-9209
> URL: https://issues.apache.org/jira/browse/YARN-9209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9209.001.patch, YARN-9209.002.patch, 
> YARN-9209.003.patch
>
>
> When application sets a placement constraint without specifying a 
> nodePartition, the default partition is always chosen as the constraint when 
> allocating containers. This can be a problem. when an application is 
> submitted to a queue which has doesn't have enough capacity available on the 
> default partition.
>  This is a common scenario when node labels are configured for a particular 
> queue. The below sample sleeper service cannot get even a single container 
> allocated when it is submitted to a "labeled_queue", even though enough 
> capacity is available on the label/partition configured for the queue. Only 
> the AM container runs. 
> {code:java}{
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ]
> }
> ]
> }
> }
> ]
> }
> {code}
> It runs fine if I specify the node_partition explicitly in the constraints 
> like below. 
> {code:java}
> {
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ],
> "node_partitions": [
> "label"
> ]
> }
> ]
> }
> }
> ]
> }
> {code} 
> The problem seems to be because only the default partition "" is considered 
> when node_partition constraint is not specified as seen in below RM log. 
> {code:java}
> 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator 
> (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367))
>  - Successfully added SchedulingRequest to 
> app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. 
> nodePartition= 
> {code} 
> However, I think it makes more sense to consider "*" or the 
> {{default-node-label-expression}} of the queue if configured, when no 
> node_partition is specified in the placement constraint. Since not specifying 
> any node_partition should ideally mean we don't enforce placement constraints 
> on any node_partition. However we are enforcing the default partition instead 
> now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-06-21 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9209:
--
Fix Version/s: 3.2.1

> When nodePartition is not set in Placement Constraints, containers are 
> allocated only in default partition
> --
>
> Key: YARN-9209
> URL: https://issues.apache.org/jira/browse/YARN-9209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9209.001.patch, YARN-9209.002.patch, 
> YARN-9209.003.patch
>
>
> When application sets a placement constraint without specifying a 
> nodePartition, the default partition is always chosen as the constraint when 
> allocating containers. This can be a problem. when an application is 
> submitted to a queue which has doesn't have enough capacity available on the 
> default partition.
>  This is a common scenario when node labels are configured for a particular 
> queue. The below sample sleeper service cannot get even a single container 
> allocated when it is submitted to a "labeled_queue", even though enough 
> capacity is available on the label/partition configured for the queue. Only 
> the AM container runs. 
> {code:java}{
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ]
> }
> ]
> }
> }
> ]
> }
> {code}
> It runs fine if I specify the node_partition explicitly in the constraints 
> like below. 
> {code:java}
> {
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ],
> "node_partitions": [
> "label"
> ]
> }
> ]
> }
> }
> ]
> }
> {code} 
> The problem seems to be because only the default partition "" is considered 
> when node_partition constraint is not specified as seen in below RM log. 
> {code:java}
> 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator 
> (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367))
>  - Successfully added SchedulingRequest to 
> app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. 
> nodePartition= 
> {code} 
> However, I think it makes more sense to consider "*" or the 
> {{default-node-label-expression}} of the queue if configured, when no 
> node_partition is specified in the placement constraint. Since not specifying 
> any node_partition should ideally mean we don't enforce placement constraints 
> on any node_partition. However we are enforcing the default partition instead 
> now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-06-21 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869342#comment-16869342
 ] 

Weiwei Yang commented on YARN-9209:
---

Agree. +1 for this patch, let's get this issue fixed first.

For the documentation, feel free to create a new issue to track.

Thanks

> When nodePartition is not set in Placement Constraints, containers are 
> allocated only in default partition
> --
>
> Key: YARN-9209
> URL: https://issues.apache.org/jira/browse/YARN-9209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9209.001.patch, YARN-9209.002.patch, 
> YARN-9209.003.patch
>
>
> When application sets a placement constraint without specifying a 
> nodePartition, the default partition is always chosen as the constraint when 
> allocating containers. This can be a problem. when an application is 
> submitted to a queue which has doesn't have enough capacity available on the 
> default partition.
>  This is a common scenario when node labels are configured for a particular 
> queue. The below sample sleeper service cannot get even a single container 
> allocated when it is submitted to a "labeled_queue", even though enough 
> capacity is available on the label/partition configured for the queue. Only 
> the AM container runs. 
> {code:java}{
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ]
> }
> ]
> }
> }
> ]
> }
> {code}
> It runs fine if I specify the node_partition explicitly in the constraints 
> like below. 
> {code:java}
> {
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ],
> "node_partitions": [
> "label"
> ]
> }
> ]
> }
> }
> ]
> }
> {code} 
> The problem seems to be because only the default partition "" is considered 
> when node_partition constraint is not specified as seen in below RM log. 
> {code:java}
> 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator 
> (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367))
>  - Successfully added SchedulingRequest to 
> app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. 
> nodePartition= 
> {code} 
> However, I think it makes more sense to consider "*" or the 
> {{default-node-label-expression}} of the queue if configured, when no 
> node_partition is specified in the placement constraint. Since not specifying 
> any node_partition should ideally mean we don't enforce placement constraints 
> on any node_partition. However we are enforcing the default partition instead 
> now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-06-21 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869275#comment-16869275
 ] 

Weiwei Yang commented on YARN-9209:
---

Hi [~tarunparimi]

Sorry, it's been a while. My understanding is this patch does fix the problem, 
when node-label is enabled, and if PC doesn't have a partition specified, then 
you add queue's default partition. This is the same logic for a request without 
PC. Correct?

Right now we have some limitations to support ANY partition in PC, like 
[~leftnoteasy] previously mentioned. I think we need to doc this somewhere to 
setup correct expectation.

BTW, the patch looks good to me.

Thanks.

> When nodePartition is not set in Placement Constraints, containers are 
> allocated only in default partition
> --
>
> Key: YARN-9209
> URL: https://issues.apache.org/jira/browse/YARN-9209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9209.001.patch, YARN-9209.002.patch, 
> YARN-9209.003.patch
>
>
> When application sets a placement constraint without specifying a 
> nodePartition, the default partition is always chosen as the constraint when 
> allocating containers. This can be a problem. when an application is 
> submitted to a queue which has doesn't have enough capacity available on the 
> default partition.
>  This is a common scenario when node labels are configured for a particular 
> queue. The below sample sleeper service cannot get even a single container 
> allocated when it is submitted to a "labeled_queue", even though enough 
> capacity is available on the label/partition configured for the queue. Only 
> the AM container runs. 
> {code:java}{
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ]
> }
> ]
> }
> }
> ]
> }
> {code}
> It runs fine if I specify the node_partition explicitly in the constraints 
> like below. 
> {code:java}
> {
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ],
> "node_partitions": [
> "label"
> ]
> }
> ]
> }
> }
> ]
> }
> {code} 
> The problem seems to be because only the default partition "" is considered 
> when node_partition constraint is not specified as seen in below RM log. 
> {code:java}
> 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator 
> (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367))
>  - Successfully added SchedulingRequest to 
> app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. 
> nodePartition= 
> {code} 
> However, I think it makes more sense to consider "*" or the 
> {{default-node-label-expression}} of the queue if configured, when no 
> node_partition is specified in the placement constraint. Since not specifying 
> any node_partition should ideally mean we don't enforce placement constraints 
> on any node_partition. However we are enforcing the default partition instead 
> now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed

2019-06-21 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869268#comment-16869268
 ] 

Weiwei Yang commented on YARN-9634:
---

Hi [~zhuqi]

Thanks. Are you suggesting to let parameter 
\{{yarn.nodemanager.remote-app-log-dir}} support multiple dirs, and use e.g 
round-robin policy to select dirs? If this is the case, will this break 
something? Such as history server, timeline-server or log CLI etc, I assume 
some of these components need to read logs from this location. Please 
elaborate, thanks.

> Make yarn submit dir and log aggregation dir more evenly distributed
> 
>
> Key: YARN-9634
> URL: https://issues.apache.org/jira/browse/YARN-9634
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> When the cluster size is large, the dir which user submits the job, and the 
> dir which container log aggregate, and other information will fill the HDFS 
> directory, because the HDFS directory has a default storage limit. In 
> response to this situation, we can change these dirs more distributed, with 
> some policy to choose, such as hash policy and round robin policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed

2019-06-20 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868547#comment-16868547
 ] 

Weiwei Yang commented on YARN-9634:
---

Hi [~zhuqi] what do you mean by the default storage limit, is that the space 
quota? Can you give an example of how to distribute these dirs, and why that 
helps? Thanks.

> Make yarn submit dir and log aggregation dir more evenly distributed
> 
>
> Key: YARN-9634
> URL: https://issues.apache.org/jira/browse/YARN-9634
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> When the cluster size is large, the dir which user submits the job, and the 
> dir which container log aggregate, and other information will fill the HDFS 
> directory, because the HDFS directory has a default storage limit. In 
> response to this situation, we can change these dirs more distributed, with 
> some policy to choose, such as hash policy and round robin policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9621) FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint on branch-3.1

2019-06-17 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865465#comment-16865465
 ] 

Weiwei Yang commented on YARN-9621:
---

Thanks [~Prabhu Joseph], [~pbacsko]. the fix LGTM, committing this shortly.

> FIX TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint 
> on branch-3.1
> ---
>
> Key: YARN-9621
> URL: https://issues.apache.org/jira/browse/YARN-9621
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Affects Versions: 3.1.2
>Reporter: Peter Bacsko
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9621-branch-3.1.001.patch, 
> YARN-9621-branch-3.1.002.patch, YARN-9621-branch-3.1.003.patch, 
> YARN-9621-branch-3.1.004.patch, YARN-9621-branch-3.1.005.patch
>
>
> Testcase 
> {{TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint}} 
> seems to constantly fail on branch 3.1. I believe it was introduced by 
> YARN-9253.
> {noformat}
> testDistributedShellWithPlacementConstraint(org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager)
>   Time elapsed: 24.636 s  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDSWithMultipleNodeManager.testDistributedShellWithPlacementConstraint(TestDSWithMultipleNodeManager.java:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-06-12 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862697#comment-16862697
 ] 

Weiwei Yang commented on YARN-9567:
---

Hi [~Tao Yang]

I will take a look at this later this week. Will share feedback then. Thank you.

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2019-06-04-17-29-29-368.png, 
> image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, 
> no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API

2019-06-12 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862658#comment-16862658
 ] 

Weiwei Yang commented on YARN-9578:
---

Thanks [~Tao Yang] for the updates, it looks good to me now. +1 for v6 patch, I 
am going to commit it shortly. Thank you.

> Add limit/actions/summarize options for app activities REST API
> ---
>
> Key: YARN-9578
> URL: https://issues.apache.org/jira/browse/YARN-9578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9578.001.patch, YARN-9578.002.patch, 
> YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch, 
> YARN-9578.006.patch
>
>
> Currently all completed activities of specified application in cache will be 
> returned for application activities REST API. Most results may be redundant 
> in some scenarios which only need a few latest results, for example, perhaps 
> only one result is needed to be shown on UI for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9578) Add limit/actions/summarize options for app activities REST API

2019-06-12 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9578:
--
Hadoop Flags: Incompatible change

> Add limit/actions/summarize options for app activities REST API
> ---
>
> Key: YARN-9578
> URL: https://issues.apache.org/jira/browse/YARN-9578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9578.001.patch, YARN-9578.002.patch, 
> YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch, 
> YARN-9578.006.patch
>
>
> Currently all completed activities of specified application in cache will be 
> returned for application activities REST API. Most results may be redundant 
> in some scenarios which only need a few latest results, for example, perhaps 
> only one result is needed to be shown on UI for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API

2019-06-12 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861949#comment-16861949
 ] 

Weiwei Yang commented on YARN-9578:
---

Thanks [~Tao Yang]. I see, then it's fine, given subList won't have much 
overhead.

For the REST API, please update to the simpler form we have discussed, I can 
mark this as an incompatible change. That should be OK, we should do it earlier 
than later.

Thanks

> Add limit/actions/summarize options for app activities REST API
> ---
>
> Key: YARN-9578
> URL: https://issues.apache.org/jira/browse/YARN-9578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9578.001.patch, YARN-9578.002.patch, 
> YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch
>
>
> Currently all completed activities of specified application in cache will be 
> returned for application activities REST API. Most results may be redundant 
> in some scenarios which only need a few latest results, for example, perhaps 
> only one result is needed to be shown on UI for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-10 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860157#comment-16860157
 ] 

Weiwei Yang commented on YARN-9598:
---

Thanks for bringing this up and the discussions. It looks like the discussion 
goes diverse somehow. Let's make sure we understand the problem we want to 
resolve here.

If I understand correctly, [~jutia] was observing the issue that 
re-reservations are made on a single node because the policy always returns the 
same order. Actually, this is not the only issue, this policy may cause 
hot-spot node when multiple threads put allocations to same ordered nodes. I 
think we need to improve the policy, one possible solution like I previously 
commented, shuffle nodes per score-range. BTW, [~jutia], are you using this 
policy already in your cluster?  

The issue [~Tao Yang] raised is also valid, re-reservations were done by a lot 
of small asks happening on lots of nodes (when the cluster is busy), it will 
cause big players to be starving. This issue should be reproducible with SLS. I 
did a quick look at the patch [~Tao Yang] uploaded, but I also have the concern 
to disable re-reservation. How can we make sure a big container request not 
getting starved in such case? Maybe a way to improve this is to swap reserved 
container on NMs, e.g if a container is already reserved on somewhere else, 
then we can swap this spot with another bigger container that has no 
reservation yet. Just a random thought.

 

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, 
> image-2019-06-10-11-37-44-975.png
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9598) Make reservation work well when multi-node enabled

2019-06-06 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857545#comment-16857545
 ] 

Weiwei Yang commented on YARN-9598:
---

Ping [~jutia], may you help to take a look too? Thanks

> Make reservation work well when multi-node enabled
> --
>
> Key: YARN-9598
> URL: https://issues.apache.org/jira/browse/YARN-9598
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9598.001.patch
>
>
> This issue is to solve problems about reservation when multi-node enabled:
>  # As discussed in YARN-9576, re-reservation proposal may be always generated 
> on the same node and break the scheduling for this app and later apps. I 
> think re-reservation in unnecessary and we can replace it with 
> LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates 
> for this app when multi-node enabled.
>  # Scheduler iterates all nodes and try to allocate for reserved container in 
> LeafQueue#allocateFromReservedContainer. Here there are two problems:
>  ** The node of reserved container should be taken as candidates instead of 
> all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later 
> scheduler may generate a reservation-fulfilled proposal on another node, 
> which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.
>  ** Assignment returned by FiCaSchedulerApp#assignContainers could never be 
> null even if it's just skipped, it will break the normal scheduling process 
> for this leaf queue because of the if clause in LeafQueue#assignContainers: 
> "if (null != assignment) \{ return assignment;}"
>  # Nodes which have been reserved should be skipped when iterating candidates 
> in RegularContainerAllocator#allocate, otherwise scheduler may generate 
> allocation or reservation proposal on these node which will always be 
> rejected in FiCaScheduler#commonCheckContainerAllocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API

2019-06-06 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857544#comment-16857544
 ] 

Weiwei Yang commented on YARN-9578:
---

Hi [~Tao Yang]

For limit option, I was referring to this code snippet 

{code}

allocations = curAllocations.stream().map(e -> 
e.filterAllocationAttempts(requestPriorities, allocationRequestIds))

  .filter(e -> !e.getAllocationAttempts().isEmpty())

  .collect(Collectors.toList());

{code}

Can we add a limit here instead of using a subList later?

Thanks

> Add limit/actions/summarize options for app activities REST API
> ---
>
> Key: YARN-9578
> URL: https://issues.apache.org/jira/browse/YARN-9578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9578.001.patch, YARN-9578.002.patch, 
> YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch
>
>
> Currently all completed activities of specified application in cache will be 
> returned for application activities REST API. Most results may be redundant 
> in some scenarios which only need a few latest results, for example, perhaps 
> only one result is needed to be shown on UI for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9590) Correct incompatible, incomplete and redundant activities

2019-06-06 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857541#comment-16857541
 ] 

Weiwei Yang commented on YARN-9590:
---

Sure, make sense to me. LGTM, +1. I'll commit shortly.

> Correct incompatible, incomplete and redundant activities
> -
>
> Key: YARN-9590
> URL: https://issues.apache.org/jira/browse/YARN-9590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9590.001.patch, YARN-9590.002.patch, 
> YARN-9590.003.patch
>
>
> Currently some branches in scheduling process may generate incomplete or 
> duplicate activities, we should fix them to make activities clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9590) Correct incompatible, incomplete and redundant activities

2019-06-06 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857513#comment-16857513
 ] 

Weiwei Yang commented on YARN-9590:
---

Hi [~Tao Yang]

Thanks, good cleanup, and fixes. Just one thing curious. Currently, 
ActivitiesManager maintains such a map of queues

ConcurrentMap> completedAppAllocations

and each queue is a \{{ConcurrentLinkedQueue}}. But since what wanted here is a 
limited-size queue, why don't you use, e.g guava's EvictingQueue?

> Correct incompatible, incomplete and redundant activities
> -
>
> Key: YARN-9590
> URL: https://issues.apache.org/jira/browse/YARN-9590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9590.001.patch, YARN-9590.002.patch, 
> YARN-9590.003.patch
>
>
> Currently some branches in scheduling process may generate incomplete or 
> duplicate activities, we should fix them to make activities clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9578) Add limit/actions/summarize options for app activities REST API

2019-06-06 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857495#comment-16857495
 ] 

Weiwei Yang commented on YARN-9578:
---

Hi [~Tao Yang]

Thanks for the patch. A few comments here

1. ActivitiesManager#getAppActivitiesInfo

Since you are supporting a limit number of records here, instead of cutting off 
from a full list, can it be done while generating the allocations?

Is there a default value for this limit?

2. RMWebServices#getAppActivities

I am a bit confused by query parameter ACTIONS, it looks like this is always 
required. But the form is not a very RESTful pattern. Instead, I am expecting 
the query URL looks like
{code:java}
/scheduler/app-activities/app-id/get?xxx
/scheduler/app-activities/app-id/update?xxx
{code}
3. RMConstants#AppActivitiesRequiredAction

Can we rename "UPDATE" to "REFRESH"?

Thanks

> Add limit/actions/summarize options for app activities REST API
> ---
>
> Key: YARN-9578
> URL: https://issues.apache.org/jira/browse/YARN-9578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9578.001.patch, YARN-9578.002.patch, 
> YARN-9578.003.patch, YARN-9578.004.patch, YARN-9578.005.patch
>
>
> Currently all completed activities of specified application in cache will be 
> returned for application activities REST API. Most results may be redundant 
> in some scenarios which only need a few latest results, for example, perhaps 
> only one result is needed to be shown on UI for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9600) Support self-adaption width for columns of containers table on app attempt page

2019-06-04 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856371#comment-16856371
 ] 

Weiwei Yang commented on YARN-9600:
---

Thanks [~akhilpb] for the additional review, I'll help to commit this shortly. 

> Support self-adaption width for columns of containers table on app attempt 
> page
> ---
>
> Key: YARN-9600
> URL: https://issues.apache.org/jira/browse/YARN-9600
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9600.001.patch, image-2019-06-04-16-45-49-359.png, 
> image-2019-06-04-16-55-18-899.png
>
>
> When there are outstanding requests showing on app attempt page, the page 
> will be automatically stretched horizontally, after that, columns of 
> containers table can't fill the table and left two blank spaces in the 
> leftmost and the rightmost of this table, as the following picture shows:
> !image-2019-06-04-16-45-49-359.png|width=647,height=231!
> We can add relative width style (width:100%) for containers table to make 
> columns self-adaption.
> After doing that containers table show as follows:
> !image-2019-06-04-16-55-18-899.png|width=645,height=229!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9600) Support self-adaption width for columns of containers table on app attempt page

2019-06-04 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855649#comment-16855649
 ] 

Weiwei Yang commented on YARN-9600:
---

Ping [~akhilpb], would you please help to review this patch? Thx

> Support self-adaption width for columns of containers table on app attempt 
> page
> ---
>
> Key: YARN-9600
> URL: https://issues.apache.org/jira/browse/YARN-9600
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-9600.001.patch, image-2019-06-04-16-45-49-359.png, 
> image-2019-06-04-16-55-18-899.png
>
>
> When there are outstanding requests showing on app attempt page, the page 
> will be automatically stretched horizontally, after that, columns of 
> containers table can't fill the table and left two blank spaces in the 
> leftmost and the rightmost of this table, as the following picture shows:
> !image-2019-06-04-16-45-49-359.png|width=647,height=231!
> We can add relative width style (width:100%) for containers table to make 
> columns self-adaption.
> After doing that containers table show as follows:
> !image-2019-06-04-16-55-18-899.png|width=645,height=229!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers

2019-06-03 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855277#comment-16855277
 ] 

Weiwei Yang commented on YARN-9580:
---

Hi [~Tao Yang]

There seems to have issues in the patch for branch-3.2, could you please take a 
look?

> Fulfilled reservation information in assignment is lost when transferring in 
> ParentQueue#assignContainers
> -
>
> Key: YARN-9580
> URL: https://issues.apache.org/jira/browse/YARN-9580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9580.001.patch, YARN-9580.branch-3.2.001.patch
>
>
> When transferring assignment from child queue to parent queue, fulfilled 
> reservation information including fulfilledReservation and 
> fulfilledReservedContainer in assignment is lost.
> When multi-nodes enabled, this lost can raise a problem that allocation 
> proposal is generated but can't be accepted because there is a check for 
> fulfilled reservation information in 
> FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will 
> always be there and the resource of the node can't be used anymore.
> In HB-driven scheduling mode, fulfilled reservation can be allocated via 
> another calling stack: CapacityScheduler#allocateContainersToNode -->  
> CapacityScheduler#allocateContainerOnSingleNode --> 
> CapacityScheduler#allocateFromReservedContainer, in this way assignment can 
> be generated by leaf queue and directly submitted, I think that's why we 
> hardly find this problem before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers

2019-06-03 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reopened YARN-9580:
---

> Fulfilled reservation information in assignment is lost when transferring in 
> ParentQueue#assignContainers
> -
>
> Key: YARN-9580
> URL: https://issues.apache.org/jira/browse/YARN-9580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9580.001.patch, YARN-9580.branch-3.2.001.patch
>
>
> When transferring assignment from child queue to parent queue, fulfilled 
> reservation information including fulfilledReservation and 
> fulfilledReservedContainer in assignment is lost.
> When multi-nodes enabled, this lost can raise a problem that allocation 
> proposal is generated but can't be accepted because there is a check for 
> fulfilled reservation information in 
> FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will 
> always be there and the resource of the node can't be used anymore.
> In HB-driven scheduling mode, fulfilled reservation can be allocated via 
> another calling stack: CapacityScheduler#allocateContainersToNode -->  
> CapacityScheduler#allocateContainerOnSingleNode --> 
> CapacityScheduler#allocateFromReservedContainer, in this way assignment can 
> be generated by leaf queue and directly submitted, I think that's why we 
> hardly find this problem before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers

2019-06-03 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854701#comment-16854701
 ] 

Weiwei Yang commented on YARN-9580:
---

Committed to trunk. [~Tao Yang], can you please provide a patch for branch-3.2 
too?

> Fulfilled reservation information in assignment is lost when transferring in 
> ParentQueue#assignContainers
> -
>
> Key: YARN-9580
> URL: https://issues.apache.org/jira/browse/YARN-9580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9580.001.patch
>
>
> When transferring assignment from child queue to parent queue, fulfilled 
> reservation information including fulfilledReservation and 
> fulfilledReservedContainer in assignment is lost.
> When multi-nodes enabled, this lost can raise a problem that allocation 
> proposal is generated but can't be accepted because there is a check for 
> fulfilled reservation information in 
> FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will 
> always be there and the resource of the node can't be used anymore.
> In HB-driven scheduling mode, fulfilled reservation can be allocated via 
> another calling stack: CapacityScheduler#allocateContainersToNode -->  
> CapacityScheduler#allocateContainerOnSingleNode --> 
> CapacityScheduler#allocateFromReservedContainer, in this way assignment can 
> be generated by leaf queue and directly submitted, I think that's why we 
> hardly find this problem before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9580) Fulfilled reservation information in assignment is lost when transferring in ParentQueue#assignContainers

2019-06-03 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854697#comment-16854697
 ] 

Weiwei Yang commented on YARN-9580:
---

[~Tao Yang], thanks for the patch, it makes sense to me. +1.

> Fulfilled reservation information in assignment is lost when transferring in 
> ParentQueue#assignContainers
> -
>
> Key: YARN-9580
> URL: https://issues.apache.org/jira/browse/YARN-9580
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9580.001.patch
>
>
> When transferring assignment from child queue to parent queue, fulfilled 
> reservation information including fulfilledReservation and 
> fulfilledReservedContainer in assignment is lost.
> When multi-nodes enabled, this lost can raise a problem that allocation 
> proposal is generated but can't be accepted because there is a check for 
> fulfilled reservation information in 
> FiCaSchedulerApp#commonCheckContainerAllocation, this endless loop will 
> always be there and the resource of the node can't be used anymore.
> In HB-driven scheduling mode, fulfilled reservation can be allocated via 
> another calling stack: CapacityScheduler#allocateContainersToNode -->  
> CapacityScheduler#allocateContainerOnSingleNode --> 
> CapacityScheduler#allocateFromReservedContainer, in this way assignment can 
> be generated by leaf queue and directly submitted, I think that's why we 
> hardly find this problem before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure

2019-06-03 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854265#comment-16854265
 ] 

Weiwei Yang commented on YARN-9507:
---

Committed to trunk, cherry-picked to branch-3.2 and branch-3.1.

> Fix NPE in NodeManager#serviceStop on startup failure
> -
>
> Key: YARN-9507
> URL: https://issues.apache.org/jira/browse/YARN-9507
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9507-001.patch
>
>
> 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When 
> stopping the service NodeManager
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure

2019-06-03 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9507:
--
Fix Version/s: 3.1.3

> Fix NPE in NodeManager#serviceStop on startup failure
> -
>
> Key: YARN-9507
> URL: https://issues.apache.org/jira/browse/YARN-9507
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9507-001.patch
>
>
> 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When 
> stopping the service NodeManager
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure

2019-06-03 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9507:
--
Fix Version/s: 3.2.1

> Fix NPE in NodeManager#serviceStop on startup failure
> -
>
> Key: YARN-9507
> URL: https://issues.apache.org/jira/browse/YARN-9507
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9507-001.patch
>
>
> 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When 
> stopping the service NodeManager
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9507) Fix NPE in NodeManager#serviceStop on startup failure

2019-06-03 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854246#comment-16854246
 ] 

Weiwei Yang commented on YARN-9507:
---

LGTM, +1. Committing shortly.

> Fix NPE in NodeManager#serviceStop on startup failure
> -
>
> Key: YARN-9507
> URL: https://issues.apache.org/jira/browse/YARN-9507
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-9507-001.patch
>
>
> 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When 
> stopping the service NodeManager
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2019-05-30 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852032#comment-16852032
 ] 

Weiwei Yang commented on YARN-9538:
---

Hi [~Tao Yang]

Thanks for the updates in doc template, it looks much better now. I've made 
some modifications based on your v2 version. Could you please take a look? 
Thanks

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8693) Add signalToContainer REST API for RMWebServices

2019-05-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850619#comment-16850619
 ] 

Weiwei Yang edited comment on YARN-8693 at 5/29/19 8:34 AM:


Hi [~Tao Yang]

Latest patch seems good to me,  warnings in latest jenkins report seems to be 
irrelevant, I'll commit this shortly.
I assume you have generated the doc and verified the format locally, if you 
haven't done that, please make sure this is done properly.

Thanks.


was (Author: cheersyang):
Hi [~Tao Yang]

Latest patch seems good to me,  warnings in latest jenkins report seems to be 
irrelevant, I'll commit this shortly.
Thanks.

> Add signalToContainer REST API for RMWebServices
> 
>
> Key: YARN-8693
> URL: https://issues.apache.org/jira/browse/YARN-8693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8693.001.patch, YARN-8693.002.patch, 
> YARN-8693.003.patch
>
>
> Currently YARN has a RPC command which is "yarn container -signal  ID [signal command]>" to signal 
> OUTPUT_THREAD_DUMP/GRACEFUL_SHUTDOWN/FORCEFUL_SHUTDOWN commands to container. 
> That is not enough and we need to add signalToContainer REST API for better 
> management from cluster administrators or management system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8693) Add signalToContainer REST API for RMWebServices

2019-05-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850619#comment-16850619
 ] 

Weiwei Yang commented on YARN-8693:
---

Hi [~Tao Yang]

Latest patch seems good to me,  warnings in latest jenkins report seems to be 
irrelevant, I'll commit this shortly.
Thanks.

> Add signalToContainer REST API for RMWebServices
> 
>
> Key: YARN-8693
> URL: https://issues.apache.org/jira/browse/YARN-8693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8693.001.patch, YARN-8693.002.patch, 
> YARN-8693.003.patch
>
>
> Currently YARN has a RPC command which is "yarn container -signal  ID [signal command]>" to signal 
> OUTPUT_THREAD_DUMP/GRACEFUL_SHUTDOWN/FORCEFUL_SHUTDOWN commands to container. 
> That is not enough and we need to add signalToContainer REST API for better 
> management from cluster administrators or management system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9578) Add limit option to control number of results for app activities REST API

2019-05-28 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849370#comment-16849370
 ] 

Weiwei Yang commented on YARN-9578:
---

Hi [~Tao Yang], can u pls submit the patch since YARN-9497 is resolved.

> Add limit option to control number of results for app activities REST API
> -
>
> Key: YARN-9578
> URL: https://issues.apache.org/jira/browse/YARN-9578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9578.001.patch
>
>
> Currently all completed activities of specified application in cache will be 
> returned for application activities REST API. Most results may be redundant 
> in some scenarios which only need a few latest results, for example, perhaps 
> only one result is needed to be shown on UI for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8693) Add signalToContainer REST API for RMWebServices

2019-05-28 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849369#comment-16849369
 ] 

Weiwei Yang commented on YARN-8693:
---

Hi [~Tao Yang]

Thanks for adding this. I have 2 comments regarding the patch.

1. In {{RMWebServices#signalToContainer}}, it throws a server-side exception 
when something goes wrong in handling the request. Can we send a proper 
response to client too? E.g 
{code:java}
return Response.status(Status.INTERNAL_SERVER_ERROR).entity(e).build();
{code}
2. you please update the doc too?

Thanks

> Add signalToContainer REST API for RMWebServices
> 
>
> Key: YARN-8693
> URL: https://issues.apache.org/jira/browse/YARN-8693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8693.001.patch, YARN-8693.002.patch
>
>
> Currently YARN has a RPC command which is "yarn container -signal  ID [signal command]>" to signal 
> OUTPUT_THREAD_DUMP/GRACEFUL_SHUTDOWN/FORCEFUL_SHUTDOWN commands to container. 
> That is not enough and we need to add signalToContainer REST API for better 
> management from cluster administrators or management system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9497) Support grouping by diagnostics for query results of scheduler and app activities

2019-05-26 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848446#comment-16848446
 ] 

Weiwei Yang commented on YARN-9497:
---

Thanks [~Tao Yang], latest patch looks good to me. Committing now.

> Support grouping by diagnostics for query results of scheduler and app 
> activities
> -
>
> Key: YARN-9497
> URL: https://issues.apache.org/jira/browse/YARN-9497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9497.001.patch, YARN-9497.002.patch, 
> YARN-9497.003.patch
>
>
> [Design Doc 
> #4.3|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.6fbpge17dmmr]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7494) Add muti-node lookup mechanism and pluggable nodes sorting policies to optimize placement decision

2019-05-26 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848445#comment-16848445
 ] 

Weiwei Yang commented on YARN-7494:
---

Thanks [~jutia] for the comments. I agree current 
ResourceUsageMultiNodeLookupPolicy might be too naive. I had similar discussion 
before with [~sunilg] and [~Tao Yang]. We should partition sorted nodes by a 
range of scores, and in each range of score, we do a shuffle and let some 
random nodes within the same range score have the chance to be picked up by 
allocation thread. We need to continue to improve these policies.

> Add muti-node lookup mechanism and pluggable nodes sorting policies to 
> optimize placement decision
> --
>
> Key: YARN-7494
> URL: https://issues.apache.org/jira/browse/YARN-7494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7494.001.patch, YARN-7494.002.patch, 
> YARN-7494.003.patch, YARN-7494.004.patch, YARN-7494.005.patch, 
> YARN-7494.006.patch, YARN-7494.007.patch, YARN-7494.008.patch, 
> YARN-7494.009.patch, YARN-7494.010.patch, YARN-7494.11.patch, 
> YARN-7494.12.patch, YARN-7494.13.patch, YARN-7494.14.patch, 
> YARN-7494.15.patch, YARN-7494.16.patch, YARN-7494.17.patch, 
> YARN-7494.18.patch, YARN-7494.19.patch, YARN-7494.20.patch, 
> YARN-7494.v0.patch, YARN-7494.v1.patch, multi-node-designProposal.png
>
>
> Instead of single node, for effectiveness we can consider a multi node lookup 
> based on partition to start with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9497) Support grouping by diagnostics for query results of scheduler and app activities

2019-05-20 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843929#comment-16843929
 ] 

Weiwei Yang commented on YARN-9497:
---

Hi [~Tao Yang]
{quote}perhaps we can take state dimension as default internally so that we can 
direct use "groupBy=diagnostic" and get results grouped by both state and 
diagnostic dimensions
{quote}
That sounds good to me.

Thanks

> Support grouping by diagnostics for query results of scheduler and app 
> activities
> -
>
> Key: YARN-9497
> URL: https://issues.apache.org/jira/browse/YARN-9497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9497.001.patch, YARN-9497.002.patch
>
>
> [Design Doc 
> #4.3|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.6fbpge17dmmr]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9497) Support grouping by diagnostics for query results of scheduler and app activities

2019-05-17 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842358#comment-16842358
 ] 

Weiwei Yang commented on YARN-9497:
---

Nope, I was not asking to ignore the state dimension, just want to have a 
better/shorter name for it. Probably use them separately, something like: 
"groupBy=state=diagnostic"? Or "groupBy=state,diagnostic".

> Support grouping by diagnostics for query results of scheduler and app 
> activities
> -
>
> Key: YARN-9497
> URL: https://issues.apache.org/jira/browse/YARN-9497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9497.001.patch, YARN-9497.002.patch
>
>
> [Design Doc 
> #4.3|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.6fbpge17dmmr]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs

2019-05-14 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839830#comment-16839830
 ] 

Weiwei Yang commented on YARN-9538:
---

Hi [~Tao Yang]

Thanks for working on the documentation. I just did some proofreading, we 
should start by simplifying the content. I have shared a google doc, maybe that 
will be easier: 
[https://docs.google.com/document/d/1NIIDCWOLUqlhrclzr91YPYOrvmLfjlKbyCWrnCMci0g].
 Once the doc is finalized there, you can convert it to md.

Thanks.

> Document scheduler/app activities and REST APIs
> ---
>
> Key: YARN-9538
> URL: https://issues.apache.org/jira/browse/YARN-9538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9538.001.patch
>
>
> Add documentation for scheduler/app activities in CapacityScheduler.md and 
> ResourceManagerRest.md.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9497) Support grouping by diagnostics for query results of scheduler and app activities

2019-05-14 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839812#comment-16839812
 ] 

Weiwei Yang commented on YARN-9497:
---

Thanks [~Tao Yang] for the updates, it looks good now. I have one minor comment 
about the query parameter, right now it is \{{groupingType}}, can we change it 
to \{{groupBy}}? And regarding the type, I think we can rename 
\{{STATE_AND_DIAGNOSTIC}} to \{{diagnostic}}, because I don't expect different 
activity state would have the same diagnostic messages, please correct me if I 
am wrong.

Another thing is, can we make sure we use "diagnostic" instead of all capital 
letters in the query parameter? Because that's the convention used in resource 
manager restful API.

Thanks

> Support grouping by diagnostics for query results of scheduler and app 
> activities
> -
>
> Key: YARN-9497
> URL: https://issues.apache.org/jira/browse/YARN-9497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9497.001.patch, YARN-9497.002.patch
>
>
> [Design Doc 
> #4.3|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.6fbpge17dmmr]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9497) Support grouping by diagnostics for query results of scheduler and app activities

2019-05-13 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838296#comment-16838296
 ] 

Weiwei Yang commented on YARN-9497:
---

Hi [~Tao Yang]

It doesn't apply to trunk any more. Can you pls rebase the patch?

Thanks

> Support grouping by diagnostics for query results of scheduler and app 
> activities
> -
>
> Key: YARN-9497
> URL: https://issues.apache.org/jira/browse/YARN-9497
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9497.001.patch
>
>
> [Design Doc 
> #4.3|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.6fbpge17dmmr]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9539) Improve cleanup process of app activities and make some conditions configurable

2019-05-12 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838275#comment-16838275
 ] 

Weiwei Yang commented on YARN-9539:
---

LGTM, +1, committing shortly

> Improve cleanup process of app activities and make some conditions 
> configurable
> ---
>
> Key: YARN-9539
> URL: https://issues.apache.org/jira/browse/YARN-9539
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9539.001.patch
>
>
> [YARN-9050 Design doc 
> #4.4|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.crdawajmm3a4]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9489) Support filtering by request-priorities and allocation-request-ids for query results of app activities

2019-05-09 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836401#comment-16836401
 ] 

Weiwei Yang commented on YARN-9489:
---

Thanks for confirming it, +1. Committing shortly.

> Support filtering by request-priorities and allocation-request-ids for query 
> results of app activities
> --
>
> Key: YARN-9489
> URL: https://issues.apache.org/jira/browse/YARN-9489
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9489.001.patch, YARN-9489.002.patch
>
>
> [Design Doc 
> #4.2|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.m04tqsosk94h]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9489) Support filtering by request-priorities and allocation-request-ids for query results of app activities

2019-05-09 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836208#comment-16836208
 ] 

Weiwei Yang commented on YARN-9489:
---

Hi [~Tao Yang]

Most changes seem good to me. Only one thing in \{{RMWebServiceProtocol}}, the 
changes to \{{#getAppActivities()}} was to add two new query parameters, that 
means we are not breaking old APIs correct? Just want to confirm.

Another thing is please create a another Jira to add some doc about the 
{{app-activities}} restful API in RM rest doc.  Currently it is missing.

Thanks

> Support filtering by request-priorities and allocation-request-ids for query 
> results of app activities
> --
>
> Key: YARN-9489
> URL: https://issues.apache.org/jira/browse/YARN-9489
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9489.001.patch, YARN-9489.002.patch
>
>
> [Design Doc 
> #4.2|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.m04tqsosk94h]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled

2019-05-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835243#comment-16835243
 ] 

Weiwei Yang commented on YARN-9432:
---

Pushed to trunk, thanks [~Tao Yang] for the contribution.

> Reserved containers leak after its request has been cancelled or satisfied 
> when multi-nodes enabled
> ---
>
> Key: YARN-9432
> URL: https://issues.apache.org/jira/browse/YARN-9432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9432.001.patch, YARN-9432.002.patch, 
> YARN-9432.003.patch, YARN-9432.004.patch
>
>
> Reserved containers may change to be excess after its request has been 
> cancelled or satisfied, excess reserved containers need to be unreserved 
> quickly to release resource for others.
> For multi-nodes disabled scenario, excess reserved containers can be quickly 
> released in next node heartbeat, the calling stack is 
> CapacityScheduler#nodeUpdate -->  CapacityScheduler#allocateContainersToNode 
> --> CapacityScheduler#allocateContainerOnSingleNode. 
> But for multi-nodes enabled scenario, excess reserved containers have chance 
> to be released only in allocation process, key phase of the calling stack is 
> LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. 
> According to this, excess reserved containers may not be released until its 
> queue has pending request and has chance to be allocated, and the worst is 
> that excess reserved containers will never be released and keep holding 
> resource if there is no additional pending request for this queue.
> To solve this problem, my opinion is to directly kill excess reserved 
> containers when request is satisfied (in FiCaSchedulerApp#apply) or the 
> allocation number of resource-requests/scheduling-requests is updated to be 0 
> (in SchedulerApplicationAttempt#updateResourceRequests / 
> SchedulerApplicationAttempt#updateSchedulingRequests).
> Please feel free to give your suggestions. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled

2019-05-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834868#comment-16834868
 ] 

Weiwei Yang commented on YARN-9432:
---

Thanks [~Tao Yang], it looks good now. +1.

I can commit this if there is no other comments by tomorrow. Thank you.

> Reserved containers leak after its request has been cancelled or satisfied 
> when multi-nodes enabled
> ---
>
> Key: YARN-9432
> URL: https://issues.apache.org/jira/browse/YARN-9432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9432.001.patch, YARN-9432.002.patch, 
> YARN-9432.003.patch, YARN-9432.004.patch
>
>
> Reserved containers may change to be excess after its request has been 
> cancelled or satisfied, excess reserved containers need to be unreserved 
> quickly to release resource for others.
> For multi-nodes disabled scenario, excess reserved containers can be quickly 
> released in next node heartbeat, the calling stack is 
> CapacityScheduler#nodeUpdate -->  CapacityScheduler#allocateContainersToNode 
> --> CapacityScheduler#allocateContainerOnSingleNode. 
> But for multi-nodes enabled scenario, excess reserved containers have chance 
> to be released only in allocation process, key phase of the calling stack is 
> LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. 
> According to this, excess reserved containers may not be released until its 
> queue has pending request and has chance to be allocated, and the worst is 
> that excess reserved containers will never be released and keep holding 
> resource if there is no additional pending request for this queue.
> To solve this problem, my opinion is to directly kill excess reserved 
> containers when request is satisfied (in FiCaSchedulerApp#apply) or the 
> allocation number of resource-requests/scheduling-requests is updated to be 0 
> (in SchedulerApplicationAttempt#updateResourceRequests / 
> SchedulerApplicationAttempt#updateSchedulingRequests).
> Please feel free to give your suggestions. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled

2019-05-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834411#comment-16834411
 ] 

Weiwei Yang commented on YARN-9432:
---

Hi [~Tao Yang]

Thanks for the patch, the overall logic makes sense to me. But only one thing, 
{code:java}
List reservedNodes = 
candidates.getAllNodes().values().stream()
  .filter(node -> node.getReservedContainer() != null)
  .collect(Collectors.toList());
for (FiCaSchedulerNode reservedNode : reservedNodes) {
  RMContainer reservedContainer = reservedNode.getReservedContainer();
  if (reservedContainer != null) { 
     allocateFromReservedContainer(reservedNode, false, reservedContainer);
  }
}
{code}
this can loop nodes twice. Can we use a single loop instead?
{code:java}
for (FiCaSchedulerNode node : candidates.getAllNodes().values()) {
 if (node.getReservedContainer() != null) {
  //
  }
}
{code}
Thanks

> Reserved containers leak after its request has been cancelled or satisfied 
> when multi-nodes enabled
> ---
>
> Key: YARN-9432
> URL: https://issues.apache.org/jira/browse/YARN-9432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9432.001.patch, YARN-9432.002.patch, 
> YARN-9432.003.patch
>
>
> Reserved containers may change to be excess after its request has been 
> cancelled or satisfied, excess reserved containers need to be unreserved 
> quickly to release resource for others.
> For multi-nodes disabled scenario, excess reserved containers can be quickly 
> released in next node heartbeat, the calling stack is 
> CapacityScheduler#nodeUpdate -->  CapacityScheduler#allocateContainersToNode 
> --> CapacityScheduler#allocateContainerOnSingleNode. 
> But for multi-nodes enabled scenario, excess reserved containers have chance 
> to be released only in allocation process, key phase of the calling stack is 
> LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. 
> According to this, excess reserved containers may not be released until its 
> queue has pending request and has chance to be allocated, and the worst is 
> that excess reserved containers will never be released and keep holding 
> resource if there is no additional pending request for this queue.
> To solve this problem, my opinion is to directly kill excess reserved 
> containers when request is satisfied (in FiCaSchedulerApp#apply) or the 
> allocation number of resource-requests/scheduling-requests is updated to be 0 
> (in SchedulerApplicationAttempt#updateResourceRequests / 
> SchedulerApplicationAttempt#updateSchedulingRequests).
> Please feel free to give your suggestions. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >