[jira] [Updated] (YARN-10608) Extend yarn.nodemanager.delete.debug-delay-sec to support application level.

2021-02-02 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10608:
--
Description: 
Now the yarn.nodemanager.delete.debug-delay-sec is a cluster level setting.

In our busy production cluster, we set it to 0 default for preventing local log 
boom.

But when we need deep into some spark/MR etc jobs errors such as core dump, i 
advice to support enable a job level setting for delay of deletion for local 
logs for reproduce the error.

 

[~wangda] [~tangzhankun]  [~xgong] [~epayne]

If you any advice about this support?

  was:
Now the yarn.nodemanager.delete.debug-delay-sec is a cluster level setting.

In our production cluster, we set it to 0 default for prevent log boom.

But when we need deep into some spark/MR etc jobs errors, i advice to support 
enable a job level setting for delay of deletion for local logs for reproduce 
the error.

  


> Extend yarn.nodemanager.delete.debug-delay-sec to support application level.
> 
>
> Key: YARN-10608
> URL: https://issues.apache.org/jira/browse/YARN-10608
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> Now the yarn.nodemanager.delete.debug-delay-sec is a cluster level setting.
> In our busy production cluster, we set it to 0 default for preventing local 
> log boom.
> But when we need deep into some spark/MR etc jobs errors such as core dump, i 
> advice to support enable a job level setting for delay of deletion for local 
> logs for reproduce the error.
>  
> [~wangda] [~tangzhankun]  [~xgong] [~epayne]
> If you any advice about this support?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10589) Improve logic of multi-node allocation

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276992#comment-17276992
 ] 

Qi Zhu commented on YARN-10589:
---

[~ztang] [~tanu.ajmera]

I attached a new patch 003  based 002 , i think return PARTITION_SKIPPED will 
return and get next PRIORITY_SKIPPED in :
{code:java}
while (iter.hasNext()) {
  FiCaSchedulerNode node = iter.next();

  if (reservedContainer == null) {
result = preCheckForNodeCandidateSet(clusterResource, node,
schedulingMode, resourceLimits, schedulerKey);
if (null != result) {
  if (result == ContainerAllocation.PARTITION_SKIPPED) {
return result;
  } else {
continue;
  }
}
  }
{code}
Thanks.

> Improve logic of multi-node allocation
> --
>
> Key: YARN-10589
> URL: https://issues.apache.org/jira/browse/YARN-10589
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 3.3.0
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10589-001.patch, YARN-10589-002.patch, 
> YARN-10589-003.patch
>
>
> {code:java}
> for (String partititon : partitions) {
>  if (current++ > start) {
>  break;
>  }
>  CandidateNodeSet candidates =
>  cs.getCandidateNodeSet(partititon);
>  if (candidates == null) {
>  continue;
>  }
>  cs.allocateContainersToNode(candidates, false);
> }{code}
> In above logic, if we have thousands of node in one partition, we will still 
> repeatedly access all nodes of the partition thousands of times. There is no 
> break point where if the partition is not same for the first node, it should 
> stop checking other nodes in that partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10608) Extend yarn.nodemanager.delete.debug-delay-sec to support application level.

2021-02-02 Thread Qi Zhu (Jira)
Qi Zhu created YARN-10608:
-

 Summary: Extend yarn.nodemanager.delete.debug-delay-sec to support 
application level.
 Key: YARN-10608
 URL: https://issues.apache.org/jira/browse/YARN-10608
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: log-aggregation
Reporter: Qi Zhu
Assignee: Qi Zhu


Now the yarn.nodemanager.delete.debug-delay-sec is a cluster level setting.

In our production cluster, we set it to 0 default for prevent log boom.

But when we need deep into some spark/MR etc jobs errors, i advice to support 
enable a job level setting for delay of deletion for local logs for reproduce 
the error.

  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10589) Improve logic of multi-node allocation

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277031#comment-17277031
 ] 

Qi Zhu commented on YARN-10589:
---

[~tanu.ajmera]

I agree with [~ztang] that we need to split the code to partition related 
logic, then to return:
{code:java}
public boolean precheckNode(SchedulerRequestKey schedulerKey,
SchedulerNode schedulerNode, SchedulingMode schedulingMode,
Optional dcOpt) {
  this.readLock.lock();
  try {
AppPlacementAllocator ap =
schedulerKeyToAppPlacementAllocator.get(schedulerKey);
return (ap != null) && (ap.getPlacementAttempt() < retryAttempts) &&
ap.precheckNode(schedulerNode, schedulingMode, dcOpt);
  } finally {
this.readLock.unlock();
  }
}
{code}

> Improve logic of multi-node allocation
> --
>
> Key: YARN-10589
> URL: https://issues.apache.org/jira/browse/YARN-10589
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 3.3.0
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10589-001.patch, YARN-10589-002.patch, 
> YARN-10589-003.patch
>
>
> {code:java}
> for (String partititon : partitions) {
>  if (current++ > start) {
>  break;
>  }
>  CandidateNodeSet candidates =
>  cs.getCandidateNodeSet(partititon);
>  if (candidates == null) {
>  continue;
>  }
>  cs.allocateContainersToNode(candidates, false);
> }{code}
> In above logic, if we have thousands of node in one partition, we will still 
> repeatedly access all nodes of the partition thousands of times. There is no 
> break point where if the partition is not same for the first node, it should 
> stop checking other nodes in that partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10589) Improve logic of multi-node allocation

2021-02-02 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10589:
--
Attachment: YARN-10589-003.patch

> Improve logic of multi-node allocation
> --
>
> Key: YARN-10589
> URL: https://issues.apache.org/jira/browse/YARN-10589
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 3.3.0
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10589-001.patch, YARN-10589-002.patch, 
> YARN-10589-003.patch
>
>
> {code:java}
> for (String partititon : partitions) {
>  if (current++ > start) {
>  break;
>  }
>  CandidateNodeSet candidates =
>  cs.getCandidateNodeSet(partititon);
>  if (candidates == null) {
>  continue;
>  }
>  cs.allocateContainersToNode(candidates, false);
> }{code}
> In above logic, if we have thousands of node in one partition, we will still 
> repeatedly access all nodes of the partition thousands of times. There is no 
> break point where if the partition is not same for the first node, it should 
> stop checking other nodes in that partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10611) Fix that shaded should be used for google guava imports in YARN-10352.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278478#comment-17278478
 ] 

Qi Zhu commented on YARN-10611:
---

Thanks for [~ahussein] reply.

The finding bugs will be fixed in YARN-10612.

The TestDelegationTokenRenewer failure is not related,  it will be fixe in 
YARN-10500.

Thanks.

> Fix that shaded should be used for google guava imports in YARN-10352.
> --
>
> Key: YARN-10611
> URL: https://issues.apache.org/jira/browse/YARN-10611
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10611.001.patch
>
>
> Fix that shaded should be used for google guava imports in YARN-10352.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278079#comment-17278079
 ] 

Qi Zhu edited comment on YARN-10610 at 2/4/21, 3:40 AM:


 [~snemeth]  [~shuzirra] 

The finding bug is not related to this change, and if check style warning 
should change, or just consistent to origin queueName field?

If you any other thoughts?

Thanks.

 


was (Author: zhuqi):
 [~snemeth]  [~shuzirra]

The finding bug is not related to this change, and i think the check style 
warning should not change, just consistent to origin queueName field.

Could you help to review for merge?

Thanks.

 

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17274163#comment-17274163
 ] 

Qi Zhu edited comment on YARN-10178 at 2/4/21, 3:27 AM:


[~wangda] [~bteke]

I have updated a patch to sort PriorityQueueResourcesForSorting, and add 
reference to queue.

I also add tests to prevent side effect/regression.

After the performance test, i found  there seems no performance cost:

In mock performance test, there are two cases, mock 1000 queues, and mock 1 
queues.

1. And i am surprise that the queue size is 1000, the new structure sort fast 
than old queue sort, the gap less than 1s.

2. When the queue size is 1, the old queue sort fast than the new structure 
sort, but the gap is always less than 10s.

If you any thoughts about this?

Thanks a lot.

 


was (Author: zhuqi):
[~wangda] [~bteke]

I have updated a patch to sort PriorityQueueResourcesForSorting, and add 
reference to queue.

I also add tests to prevent side effect/regression.

After the performance test, i found the there seems no performance cost:

1. And i am surprise that the queue size is about 1000, the new sort fast than 
old.

2. When the queue size is huge big : 1, the old fast than the old, but the 
gap is always less than 10s.

If you any thoughts about this?

Thanks a lot.

 

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity 

[jira] [Created] (YARN-10609) Update the document for YARN-10531(Be able to disable user limit factor for CapacityScheduler Leaf Queue)

2021-02-02 Thread Qi Zhu (Jira)
Qi Zhu created YARN-10609:
-

 Summary: Update the document for YARN-10531(Be able to disable 
user limit factor for CapacityScheduler Leaf Queue)
 Key: YARN-10609
 URL: https://issues.apache.org/jira/browse/YARN-10609
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Qi Zhu
Assignee: Qi Zhu


Since we have finished YARN-10531.

We should update the corresponding document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10609) Update the document for YARN-10531(Be able to disable user limit factor for CapacityScheduler Leaf Queue)

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277137#comment-17277137
 ] 

Qi Zhu commented on YARN-10609:
---

cc [~wangda] [~snemeth] [~gandras] [~pbacsko]

If you can review and merge the doc update of  YARN-10531(Be able to disable 
user limit factor for CapacityScheduler Leaf Queue).

Thanks.

> Update the document for YARN-10531(Be able to disable user limit factor for 
> CapacityScheduler Leaf Queue)
> -
>
> Key: YARN-10609
> URL: https://issues.apache.org/jira/browse/YARN-10609
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10609.001.patch
>
>
> Since we have finished YARN-10531.
> We should update the corresponding document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-02 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10532:
--
Attachment: YARN-10532.010.patch

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10609) Update the document for YARN-10531(Be able to disable user limit factor for CapacityScheduler Leaf Queue)

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277137#comment-17277137
 ] 

Qi Zhu edited comment on YARN-10609 at 2/2/21, 2:19 PM:


cc [~wangda] [~snemeth] [~gandras] [~pbacsko]

If you can help review and merge the doc update of  YARN-10531(Be able to 
disable user limit factor for CapacityScheduler Leaf Queue).

Thanks.


was (Author: zhuqi):
cc [~wangda] [~snemeth] [~gandras] [~pbacsko]

If you can review and merge the doc update of  YARN-10531(Be able to disable 
user limit factor for CapacityScheduler Leaf Queue).

Thanks.

> Update the document for YARN-10531(Be able to disable user limit factor for 
> CapacityScheduler Leaf Queue)
> -
>
> Key: YARN-10609
> URL: https://issues.apache.org/jira/browse/YARN-10609
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10609.001.patch
>
>
> Since we have finished YARN-10531.
> We should update the corresponding document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277704#comment-17277704
 ] 

Qi Zhu commented on YARN-10610:
---

1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with cs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath. 

I think we should support it in cs.

cc [~wangda]  [~shuzirra] [~snemeth] [~pbacsko]

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: image-2021-02-03-13-47-13-516.png
>
>
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277704#comment-17277704
 ] 

Qi Zhu edited comment on YARN-10610 at 2/3/21, 5:53 AM:


1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath. 

I think we should support it in cs.

cc [~wangda]  [~shuzirra] [~snemeth] [~pbacsko]


was (Author: zhuqi):
1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with cs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath. 

I think we should support it in cs.

cc [~wangda]  [~shuzirra] [~snemeth] [~pbacsko]

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: image-2021-02-03-13-47-13-516.png
>
>
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-02 Thread Qi Zhu (Jira)
Qi Zhu created YARN-10610:
-

 Summary: Add queuePath to restful api for CapacityScheduler 
consistent with FairScheduler queuePath.
 Key: YARN-10610
 URL: https://issues.apache.org/jira/browse/YARN-10610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Qi Zhu
Assignee: Qi Zhu
 Attachments: image-2021-02-03-13-47-13-516.png

!image-2021-02-03-13-47-13-516.png|width=631,height=356!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-02 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10532:
--
Attachment: YARN-10532.011.patch

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277660#comment-17277660
 ] 

Qi Zhu commented on YARN-10532:
---

Thanks for [~gandras] patient review.

I have fixed the above. 

But the signalToSubmitToQueue should be renamed to signalSubmission, i still 
don't change, because the parent queue also use this, i think the name is 
reasonable.

I also add a test to check the schedule of  AutoCreatedQueueDeletionPolicy. 

Also i add the support to disable auto deletion for some queue.

[~wangda] 

If you any other thoughs?

Thanks.

 

 

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-02 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277704#comment-17277704
 ] 

Qi Zhu edited comment on YARN-10610 at 2/3/21, 5:54 AM:


1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath in rest api scheduler info.

I think we should support it in cs.

cc [~wangda]  [~shuzirra] [~snemeth] [~pbacsko]


was (Author: zhuqi):
1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath. 

I think we should support it in cs.

cc [~wangda]  [~shuzirra] [~snemeth] [~pbacsko]

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: image-2021-02-03-13-47-13-516.png
>
>
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-02 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10610:
--
Description: 
The cs only have queueName, but not full queuePath.

!image-2021-02-03-13-47-13-516.png|width=631,height=356!

 

 

  was:
!image-2021-02-03-13-47-13-516.png|width=631,height=356!

 

 


> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10532:
--
Attachment: YARN-10532.012.patch

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277789#comment-17277789
 ] 

Qi Zhu commented on YARN-10532:
---

Fixed the java doc, finding bugs, and check style in latest patch.

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10610:
--
Issue Type: Improvement  (was: Bug)

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277704#comment-17277704
 ] 

Qi Zhu edited comment on YARN-10610 at 2/3/21, 9:27 AM:


1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath in rest api scheduler info.

I think we should support it in cs.

cc [~wangda]  [~tangzhankun] [~shuzirra] [~snemeth] [~pbacsko]


was (Author: zhuqi):
1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath in rest api scheduler info.

I think we should support it in cs.

cc [~wangda]  [~shuzirra] [~snemeth] [~pbacsko]

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277704#comment-17277704
 ] 

Qi Zhu edited comment on YARN-10610 at 2/3/21, 9:29 AM:


1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath in rest api scheduler info.

I think we should support it in cs, submitted a patch for review, thanks.

cc [~wangda]  [~tangzhankun] [~shuzirra] [~snemeth] [~pbacsko]


was (Author: zhuqi):
1. In our product cluster, we want to migrate fs to cs, but the cs restful api 
don't have queue path consistent with fs.

2. Now, the cs support same name leaf queue, so this is also needed for full 
queuePath in rest api scheduler info.

I think we should support it in cs.

cc [~wangda]  [~tangzhankun] [~shuzirra] [~snemeth] [~pbacsko]

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277977#comment-17277977
 ] 

Qi Zhu commented on YARN-10610:
---

Thanks a lot for [~shuzirra] review.

Fix the TestRMWebServicesForCSWithPartitions test case in latest patch.

 

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10610:
--
Attachment: YARN-10610.002.patch

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10532:
--
Attachment: YARN-10532.013.patch

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch, YARN-10532.013.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10532:
--
Attachment: YARN-10532.014.patch

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch, YARN-10532.013.patch, YARN-10532.014.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278079#comment-17278079
 ] 

Qi Zhu commented on YARN-10610:
---

 [~snemeth]  [~shuzirra]

The finding bug is not related to this change, and i think the check style 
warning should not change, just consistent to origin queueName field.

Could you help to review for merge?

Thanks.

 

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10611) Fix that shaded should be used for google guava imports in YARN-10352.

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278132#comment-17278132
 ] 

Qi Zhu commented on YARN-10611:
---

cc [~ahussein]

Fixed the guava import in 
[TestCapacitySchedulerMultiNodes-L#28|https://github.com/apache/hadoop/commit/6fc26ad5392a2a61ace60b88ed931fed3859365d#diff-34d534eb66cd9af6d7c47a9f643d598b1ad4cef3453219457769e92fbd4a649dR28]
  here.

Thanks.

> Fix that shaded should be used for google guava imports in YARN-10352.
> --
>
> Key: YARN-10611
> URL: https://issues.apache.org/jira/browse/YARN-10611
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10611.001.patch
>
>
> Fix that shaded should be used for google guava imports in YARN-10352.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278121#comment-17278121
 ] 

Qi Zhu commented on YARN-10352:
---

Thanks for [~ahussein] review.

I will help to fix this 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Fix For: 3.4.0
>
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10611) Fix that shaded should be used for google guava imports in YARN-10352.

2021-02-03 Thread Qi Zhu (Jira)
Qi Zhu created YARN-10611:
-

 Summary: Fix that shaded should be used for google guava imports 
in YARN-10352.
 Key: YARN-10611
 URL: https://issues.apache.org/jira/browse/YARN-10611
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Qi Zhu
Assignee: Qi Zhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10611) Fix that shaded should be used for google guava imports in YARN-10352.

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10611:
--
Description: Fix that shaded should be used for google guava imports in 
YARN-10352.  (was: Fix )

> Fix that shaded should be used for google guava imports in YARN-10352.
> --
>
> Key: YARN-10611
> URL: https://issues.apache.org/jira/browse/YARN-10611
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> Fix that shaded should be used for google guava imports in YARN-10352.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10611) Fix that shaded should be used for google guava imports in YARN-10352.

2021-02-03 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10611:
--
Description: Fix 

> Fix that shaded should be used for google guava imports in YARN-10352.
> --
>
> Key: YARN-10611
> URL: https://issues.apache.org/jira/browse/YARN-10611
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> Fix 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-02-03 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278134#comment-17278134
 ] 

Qi Zhu commented on YARN-10352:
---

[~ahussein]

Fixed it in YARN-10611.

Thanks.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Fix For: 3.4.0
>
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10807) Parents node labels are incorrectly added to child queues in weight mode

2021-06-07 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358980#comment-17358980
 ] 

Qi Zhu commented on YARN-10807:
---

Thanks [~bteke] for update.

The patch LGTM.

 

> Parents node labels are incorrectly added to child queues in weight mode 
> -
>
> Key: YARN-10807
> URL: https://issues.apache.org/jira/browse/YARN-10807
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10807.001.patch, YARN-10807.002.patch
>
>
> In ParentQueue.updateClusterResource when calculating the normalized weights 
> CS will iterate through the parent's nodelabels. If the parent has a node 
> label that a specific child doesn't it will incorrectly added to the child's 
> node label list through the queueCapacities.setNormalizedWeights(label, 
> weight) call:
> {code:java}
> // Normalize weight of children
>   if (getCapacityConfigurationTypeForQueues(childQueues)
>   == QueueCapacityType.WEIGHT) {
> for (String nodeLabel : queueCapacities.getExistingNodeLabels()) {
>   float sumOfWeight = 0;
>   for (CSQueue queue : childQueues) {
> float weight = Math.max(0,
> queue.getQueueCapacities().getWeight(nodeLabel));
> sumOfWeight += weight;
>   }
>   // When sum of weight == 0, skip setting normalized_weight (so
>   // normalized weight will be 0).
>   if (Math.abs(sumOfWeight) > 1e-6) {
> for (CSQueue queue : childQueues) {
> queue.getQueueCapacities().setNormalizedWeight(nodeLabel,
> queue.getQueueCapacities().getWeight(nodeLabel) / 
> sumOfWeight);
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10801) Fix Auto Queue template to properly set all configuration properties

2021-06-09 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360017#comment-17360017
 ] 

Qi Zhu commented on YARN-10801:
---

Thanks [~gandras] for update.

The latest patch LGTM.

> Fix Auto Queue template to properly set all configuration properties
> 
>
> Key: YARN-10801
> URL: https://issues.apache.org/jira/browse/YARN-10801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10801.001.patch, YARN-10801.002.patch, 
> YARN-10801.003.patch, YARN-10801.004.patch
>
>
> Currently Auto Queue templates set configuration properties only on 
> Configuration object passed in the constructor. Due to the fact, that a lot 
> of configuration values are ready from the Configuration object in csContext, 
> template properties are not set in every cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10657) We should make max application per queue to support node label.

2021-06-09 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360109#comment-17360109
 ] 

Qi Zhu commented on YARN-10657:
---

[~gandras] Of course you can take it, and i will help review. :D

Assigned it to you.

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10657) We should make max application per queue to support node label.

2021-06-09 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reassigned YARN-10657:
-

Assignee: Andras Gyori

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10657) We should make max application per queue to support node label.

2021-06-09 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reassigned YARN-10657:
-

Assignee: (was: Qi Zhu)

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10807) Parents node labels are incorrectly added to child queues in weight mode

2021-06-08 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359343#comment-17359343
 ] 

Qi Zhu commented on YARN-10807:
---

Thanks [~bteke] for patch and [~gandras] for review.

Committed to trunk.

> Parents node labels are incorrectly added to child queues in weight mode 
> -
>
> Key: YARN-10807
> URL: https://issues.apache.org/jira/browse/YARN-10807
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10807.001.patch, YARN-10807.002.patch
>
>
> In ParentQueue.updateClusterResource when calculating the normalized weights 
> CS will iterate through the parent's nodelabels. If the parent has a node 
> label that a specific child doesn't it will incorrectly added to the child's 
> node label list through the queueCapacities.setNormalizedWeights(label, 
> weight) call:
> {code:java}
> // Normalize weight of children
>   if (getCapacityConfigurationTypeForQueues(childQueues)
>   == QueueCapacityType.WEIGHT) {
> for (String nodeLabel : queueCapacities.getExistingNodeLabels()) {
>   float sumOfWeight = 0;
>   for (CSQueue queue : childQueues) {
> float weight = Math.max(0,
> queue.getQueueCapacities().getWeight(nodeLabel));
> sumOfWeight += weight;
>   }
>   // When sum of weight == 0, skip setting normalized_weight (so
>   // normalized weight will be 0).
>   if (Math.abs(sumOfWeight) > 1e-6) {
> for (CSQueue queue : childQueues) {
> queue.getQueueCapacities().setNormalizedWeight(nodeLabel,
> queue.getQueueCapacities().getWeight(nodeLabel) / 
> sumOfWeight);
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10801) Fix Auto Queue template to properly set all configuration properties

2021-06-08 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359366#comment-17359366
 ] 

Qi Zhu edited comment on YARN-10801 at 6/8/21, 1:42 PM:


Thanks [~gandras] for patch LGTM now.

I have a question about the code, if we should also make the 
MaximumApplicationMasterResourcePerQueuePercent to 100% since the user limit 
already unlimted, if we should also make this unlimited?

Other value just 0.5 0.6 0.7, we can't define an accurate value, what do you 
think about this?
cc [~bteke] [~gandras] 
{code:java}
if (isDynamicQueue()) {
  // set to -1, to disable it
  configuration.setUserLimitFactor(getQueuePath(), -1);
  // Set Max AM percentage to a higher value
  configuration.setMaximumApplicationMasterResourcePerQueuePercent(
  getQueuePath(), 0.5f);
}
{code}
Thanks.


was (Author: zhuqi):
Thanks [~gandras] for patch LGTM now.


I have a question about the code, if we should also make the 
MaximumApplicationMasterResourcePerQueuePercent to 100% since the user limit 
already unlimted, if we should also make this unlimited?
{code:java}
if (isDynamicQueue()) {
  // set to -1, to disable it
  configuration.setUserLimitFactor(getQueuePath(), -1);
  // Set Max AM percentage to a higher value
  configuration.setMaximumApplicationMasterResourcePerQueuePercent(
  getQueuePath(), 0.5f);
}
{code}
Thanks.

> Fix Auto Queue template to properly set all configuration properties
> 
>
> Key: YARN-10801
> URL: https://issues.apache.org/jira/browse/YARN-10801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10801.001.patch, YARN-10801.002.patch, 
> YARN-10801.003.patch
>
>
> Currently Auto Queue templates set configuration properties only on 
> Configuration object passed in the constructor. Due to the fact, that a lot 
> of configuration values are ready from the Configuration object in csContext, 
> template properties are not set in every cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10801) Fix Auto Queue template to properly set all configuration properties

2021-06-08 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359366#comment-17359366
 ] 

Qi Zhu commented on YARN-10801:
---

Thanks [~gandras] for patch LGTM now.


I have a question about the code, if we should also make the 
MaximumApplicationMasterResourcePerQueuePercent to 100% since the user limit 
already unlimted, if we should also make this unlimited?
{code:java}
if (isDynamicQueue()) {
  // set to -1, to disable it
  configuration.setUserLimitFactor(getQueuePath(), -1);
  // Set Max AM percentage to a higher value
  configuration.setMaximumApplicationMasterResourcePerQueuePercent(
  getQueuePath(), 0.5f);
}
{code}
Thanks.

> Fix Auto Queue template to properly set all configuration properties
> 
>
> Key: YARN-10801
> URL: https://issues.apache.org/jira/browse/YARN-10801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10801.001.patch, YARN-10801.002.patch, 
> YARN-10801.003.patch
>
>
> Currently Auto Queue templates set configuration properties only on 
> Configuration object passed in the constructor. Due to the fact, that a lot 
> of configuration values are ready from the Configuration object in csContext, 
> template properties are not set in every cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10632) Make maximum depth allowed configurable.

2021-05-13 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10632:
--
Attachment: YARN-10632.004.patch

> Make maximum depth allowed configurable.
> 
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10632) Make maximum depth allowed configurable.

2021-05-13 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343950#comment-17343950
 ] 

Qi Zhu commented on YARN-10632:
---

Fixed the checkstyle and java doc in latest patch.

> Make maximum depth allowed configurable.
> 
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10759) Encapsulate queue config modes

2021-05-10 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342269#comment-17342269
 ] 

Qi Zhu commented on YARN-10759:
---

Thanks [~gandras] for this work.

Very good work, i am very appreciate you can make the code more clear.

LGTM +1  Just fix the checkstyle.

> Encapsulate queue config modes
> --
>
> Key: YARN-10759
> URL: https://issues.apache.org/jira/browse/YARN-10759
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10759.001.patch, YARN-10759.002.patch, 
> YARN-10759.003.patch
>
>
> Capacity Scheduler queues have three modes:
>  * relative/percentage
>  * weight
>  * absolute
> Most of them have their own:
>  * validation logic
>  * config setting logic
>  * effective capacity calculation logic
> These logics can be easily extracted and encapsulated in separate config mode 
> classes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

2021-05-12 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343084#comment-17343084
 ] 

Qi Zhu commented on YARN-10517:
---

Thanks [~zhanqi.cai] for confirm.

cc [~pbacsko]  [~ebadger] [~epayne]

> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> --
>
> Key: YARN-10517
> URL: https://issues.apache.org/jira/browse/YARN-10517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0, 3.3.0
>Reporter: sibyl.lv
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10517-branch-3.2.001.patch, YARN-10517.001.patch, 
> wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
> incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
> {color}{color:#660e7a}allocatedVCores and 
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
> updated from "DEFAULT" to other label and there are  running applications.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Submit one application to default partition and run
>  # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
> "tpcds" when the above application is running
>  # Note down "VCores Used" at Web UI
>  # When the application is finished, the metrics get wrong (screenshots 
> attached).
> ==
>  
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used 
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
> String newPartition) {
>   Resource containerResource = rmContainer.getAllocatedResource();
>   this.attemptResourceUsage.decUsed(oldPartition, containerResource);
>   this.attemptResourceUsage.incUsed(newPartition, containerResource);
>   getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
>   getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
>   // Update new partition name if container is AM and also update AM resource
>   if (rmContainer.isAMContainer()) {
> setAppAMNodePartitionName(newPartition);
> this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
> this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
> getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
> getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10764) Add rm dispatcher event metrics in SLS

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344490#comment-17344490
 ] 

Qi Zhu commented on YARN-10764:
---

I think we should add the event related metrics to SLS, such as :
 # The event queue size.
 # The every event type consuming average time. etc

cc [~snemeth]

You are the expert of SLS, what's your opinion about this?

Thanks a lot.

> Add rm dispatcher event metrics in SLS 
> ---
>
> Key: YARN-10764
> URL: https://issues.apache.org/jira/browse/YARN-10764
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler-load-simulator
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> We should use SLS to confirm if we can get performance improvement of event 
> consume time etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10737) Fix typos in CapacityScheduler#schedule.

2021-05-13 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344319#comment-17344319
 ] 

Qi Zhu commented on YARN-10737:
---

Thanks [~hexiaoqiao]  [@fdalsotto|https://github.com/fdalsotto] for review.
Merged to trunk.

> Fix typos in CapacityScheduler#schedule.
> 
>
> Key: YARN-10737
> URL: https://issues.apache.org/jira/browse/YARN-10737
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344372#comment-17344372
 ] 

Qi Zhu commented on YARN-10761:
---

Thanks [~ebadger] [~gandras]  [~chaosju] for review.

Merged to trunk.

 

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> YARN-10761.003.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-05-14 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reassigned YARN-10324:
-

Assignee: Yao Guangdong

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344478#comment-17344478
 ] 

Qi Zhu commented on YARN-10324:
---

Hi [~yaoguangdong] 

Thanks for this work. I have added you to the contributor list.

You can submit latest patch to trigger the jenkins.

 

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344483#comment-17344483
 ] 

Qi Zhu commented on YARN-10761:
---

Thanks [~snemeth] for reminder.

Sorry for the commit.
The YARN-9615 is contributed by me, so i commit this related small change.

I will wait other committers to check and commit next time.

 

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> YARN-10761.003.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344483#comment-17344483
 ] 

Qi Zhu edited comment on YARN-10761 at 5/14/21, 9:26 AM:
-

Thanks [~snemeth] for reminder.

[~snemeth] [~ebadger] Sorry for the commit, and it's the first time i start to 
commit after to be the committer.

The YARN-9615 is contributed by me, so i commit this related small change.

I will wait other committers to check and commit when (more than 2 +1) next 
time, i will study from you, to be a strict committer. 

Thanks again.

 


was (Author: zhuqi):
Thanks [~snemeth] for reminder.

Sorry for the commit, and it's the first time i start to commit after to be the 
committer.
 The YARN-9615 is contributed by me, so i commit this related small change.

I will wait other committers to check and commit when (more than 2 +1) next 
time, i will study from you, to be a strict committer. 

Thanks again.

 

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> YARN-10761.003.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10766) [UI2] Bump moment-timezone to 0.5.33

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344462#comment-17344462
 ] 

Qi Zhu commented on YARN-10766:
---

Thanks [~gandras] for patch.

LGTM +1

> [UI2] Bump moment-timezone to 0.5.33
> 
>
> Key: YARN-10766
> URL: https://issues.apache.org/jira/browse/YARN-10766
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn, yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: UI2_Correct_Timezone_After_Bump.png, 
> UI2_Wrong_Timezone_Before_Bump.png, YARN-10766.001.patch
>
>
> A handful of timezone related fixes were added into 0.5.33 release of 
> moment-timezone. An example for a scenario in which current UI2 behaviour is 
> not correct is a user from Australia, where the submission time showed on UI2 
> is one hour ahead of the actual time.
> Unfortunately moment-timezone data range files have been renamed, which is a 
> breaking change from the point of view of emberjs. Including all timezones 
> will increase the overall size of UI2 by an additional ~6 kbs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344531#comment-17344531
 ] 

Qi Zhu commented on YARN-9615:
--

[~chaosju] Sure.:D

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, 
> YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, 
> YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, 
> YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, 
> YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-16 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345849#comment-17345849
 ] 

Qi Zhu commented on YARN-10763:
---

Thanks [~chaosju] for update.

The latest patch LGTM +1.

 

> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: chaosju
>Assignee: chaosju
>Priority: Minor
> Attachments: YARN-10763.001.patch, YARN-10763.002.patch, 
> YARN-10763.003.patch, YARN-10763.004.patch, YARN-10763.005.patch, 
> YARN-10763.006.patch, YARN-10763.007.patch, YARN-10763.008.patch, 
> screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10555) missing access check before getAppAttempts

2021-05-17 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345995#comment-17345995
 ] 

Qi Zhu commented on YARN-10555:
---

It was merged by [~aajisaka] , i just make it fixed.

Thanks [~xiaoheipangzi] for contribution, [~aajisaka] for merge.

>  missing access check before getAppAttempts
> ---
>
> Key: YARN-10555
> URL: https://issues.apache.org/jira/browse/YARN-10555
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
>  Labels: pull-request-available, security
> Fix For: 3.4.0
>
> Attachments: YARN-10555_1.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> It seems that we miss a security check before getAppAttempts, see 
> [https://github.com/apache/hadoop/blob/513f1995adc9b73f9c7f4c7beb89725b51b313ac/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java#L1127]
> thus we can get the some sensitive information, like logs link.  
> {code:java}
> application_1609318368700_0002 belong to user2
> user1@hadoop11$ curl --negotiate -u  : 
> http://hadoop11:8088/ws/v1/cluster/apps/application_1609318368700_0002/appattempts/|jq
> {
>   "appAttempts": {
> "appAttempt": [
>   {
> "id": 1,
> "startTime": 1609318411566,
> "containerId": "container_1609318368700_0002_01_01",
> "nodeHttpAddress": "hadoop12:8044",
> "nodeId": "hadoop12:36831",
> "logsLink": 
> "http://hadoop12:8044/node/containerlogs/container_1609318368700_0002_01_01/user2;,
> "blacklistedNodes": "",
> "nodesBlacklistedBySystem": ""
>   }
> ]
>   }
> }
> {code}
> Other apis, like getApps and getApp, has access check  like "hasAccess(app, 
> hsr)", they would hide the logs link if the appid do not belong to query 
> user, see 
> [https://github.com/apache/hadoop/blob/513f1995adc9b73f9c7f4c7beb89725b51b313ac/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java#L1098]
>  We need add hasAccess(app, hsr) for getAppAttempts.
>  
> Besides, at 
> [https://github.com/apache/hadoop/blob/580a6a75a3e3d3b7918edeffd6e93fc211166884/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java#L145]
> it seems that we have  a access check in its caller,  so now i pass "true" to 
> AppAttemptInfo in the patch.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8564) Add queue level application lifetime monitor in FairScheduler

2021-05-18 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346911#comment-17346911
 ] 

Qi Zhu commented on YARN-8564:
--

Thanks [~tarunparimi] for reminder.

I reopened it, you can take it if you are interested in improving this.

> Add queue level application lifetime monitor in FairScheduler 
> --
>
> Key: YARN-8564
> URL: https://issues.apache.org/jira/browse/YARN-8564
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-8564.001.patch, test1~3.jpg, test4.jpg
>
>
> I wish to have application lifetime monitor for queue level in FairSheduler. 
> In our large yarn cluster, sometimes there are too many small jobs in one 
> minor queue but may run too long, it may affect our our high priority and 
> very important queue . If we can have queue level application lifetime 
> monitor in the queue level, and set small lifetime in the minor queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-18 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347005#comment-17347005
 ] 

Qi Zhu commented on YARN-10771:
---

Thanks [~chaosju] for update.

The patch LGTM, go on to fix the checkstyle.

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch, YARN-10771.002.patch, 
> YARN-10771.003.patch
>
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8564) Add queue level application lifetime monitor in FairScheduler

2021-05-18 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reopened YARN-8564:
--

> Add queue level application lifetime monitor in FairScheduler 
> --
>
> Key: YARN-8564
> URL: https://issues.apache.org/jira/browse/YARN-8564
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-8564.001.patch, test1~3.jpg, test4.jpg
>
>
> I wish to have application lifetime monitor for queue level in FairSheduler. 
> In our large yarn cluster, sometimes there are too many small jobs in one 
> minor queue but may run too long, it may affect our our high priority and 
> very important queue . If we can have queue level application lifetime 
> monitor in the queue level, and set small lifetime in the minor queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10774) Federation: Normalize the yarn federation queue name

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347427#comment-17347427
 ] 

Qi Zhu edited comment on YARN-10774 at 5/19/21, 9:01 AM:
-

[~luoyuan] Now fs support both root.XXX and xxx but cs still not support this.

See YARN-10728.

Thanks.


was (Author: zhuqi):
[~luoyuan] Now fs support both root.XXX and xxx but cs still not support this.

> Federation: Normalize the yarn federation queue name
> 
>
> Key: YARN-10774
> URL: https://issues.apache.org/jira/browse/YARN-10774
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, yarn
>Reporter: Yuan LUO
>Priority: Major
> Attachments: YARN-10774.001.patch
>
>
> While in YARN at root.abc is equivalent to the abc queue, the routing 
> behavior of both should be consistent in yarn federation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10701) The yarn.resource-types should support multi types without trimmed.

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347438#comment-17347438
 ] 

Qi Zhu commented on YARN-10701:
---

Thanks [~weichiu] for reminder.

I will help to backport to branch-3.3.

> The yarn.resource-types should support multi types without trimmed.
> ---
>
> Key: YARN-10701
> URL: https://issues.apache.org/jira/browse/YARN-10701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10701.001.patch, YARN-10701.002.patch
>
>
> {code:java}
> 
>  
>  yarn.resource-types
>  yarn.io/gpu, yarn.io/fpga
>  
>  {code}
>  When i configured the resource type above with gpu and fpga, the error 
> happend:
>  
> {code:java}
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: ' yarn.io/fpga' is 
> not a valid resource name. A valid resource name must begin with a letter and 
> contain only letters, numbers, and any of: '.', '_', or '-'. A valid resource 
> name may also be optionally preceded by a name space followed by a slash. A 
> valid name space consists of period-separated groups of letters, numbers, and 
> dashes.{code}
>   
>  The resource types should support trim.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10774) Federation: Normalize the yarn federation queue name

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347427#comment-17347427
 ] 

Qi Zhu commented on YARN-10774:
---

[~luoyuan] Now fs support both root.XXX and xxx but cs still not support this.

> Federation: Normalize the yarn federation queue name
> 
>
> Key: YARN-10774
> URL: https://issues.apache.org/jira/browse/YARN-10774
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, yarn
>Reporter: Yuan LUO
>Priority: Major
> Attachments: YARN-10774.001.patch
>
>
> While in YARN at root.abc is equivalent to the abc queue, the routing 
> behavior of both should be consistent in yarn federation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10701) The yarn.resource-types should support multi types without trimmed.

2021-05-19 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reopened YARN-10701:
---

> The yarn.resource-types should support multi types without trimmed.
> ---
>
> Key: YARN-10701
> URL: https://issues.apache.org/jira/browse/YARN-10701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10701.001.patch, YARN-10701.002.patch
>
>
> {code:java}
> 
>  
>  yarn.resource-types
>  yarn.io/gpu, yarn.io/fpga
>  
>  {code}
>  When i configured the resource type above with gpu and fpga, the error 
> happend:
>  
> {code:java}
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: ' yarn.io/fpga' is 
> not a valid resource name. A valid resource name must begin with a letter and 
> contain only letters, numbers, and any of: '.', '_', or '-'. A valid resource 
> name may also be optionally preceded by a name space followed by a slash. A 
> valid name space consists of period-separated groups of letters, numbers, and 
> dashes.{code}
>   
>  The resource types should support trim.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10701) The yarn.resource-types should support multi types without trimmed.

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347445#comment-17347445
 ] 

Qi Zhu commented on YARN-10701:
---

Submitted backport-3.3 patch to trigger jenkins.

> The yarn.resource-types should support multi types without trimmed.
> ---
>
> Key: YARN-10701
> URL: https://issues.apache.org/jira/browse/YARN-10701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10701-branch-3.3.001.patch, YARN-10701.001.patch, 
> YARN-10701.002.patch
>
>
> {code:java}
> 
>  
>  yarn.resource-types
>  yarn.io/gpu, yarn.io/fpga
>  
>  {code}
>  When i configured the resource type above with gpu and fpga, the error 
> happend:
>  
> {code:java}
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: ' yarn.io/fpga' is 
> not a valid resource name. A valid resource name must begin with a letter and 
> contain only letters, numbers, and any of: '.', '_', or '-'. A valid resource 
> name may also be optionally preceded by a name space followed by a slash. A 
> valid name space consists of period-separated groups of letters, numbers, and 
> dashes.{code}
>   
>  The resource types should support trim.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.

2021-05-07 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340599#comment-17340599
 ] 

Qi Zhu commented on YARN-10738:
---

Thanks a lot [~bibinchundatt] for reply and value information.

For above hot spots case, if we can allocate based the dominated resource 
utilization, if the vcore is full, the vcore is dominated, we will allocate 
other nodes whose dominated resource utilization is not full.

 

Based the dominated resource utilization, i think we still need to shuffle but 
the shuffle gap may be consistent with the cluster size.

 

> When multi thread scheduling with multi node, we should shuffle with a gap to 
> prevent hot accessing nodes.
> --
>
> Key: YARN-10738
> URL: https://issues.apache.org/jira/browse/YARN-10738
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now the multi threading scheduling with multi node is not reasonable.
> In large clusters, it will cause the hot accessing nodes, which will lead the 
> abnormal boom node.
> Solution:
> I think we should shuffle the sorted node (such the available resource sort 
> policy) with an interval. 
> I will solve the above problem, and avoid the hot accessing node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10761:
--
Attachment: YARN-10761.002.patch

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> image-2021-05-06-16-38-51-406.png, image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10755) Multithreaded loading Apps from zk statestore

2021-05-06 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340570#comment-17340570
 ] 

Qi Zhu commented on YARN-10755:
---

Thanks [~chaosju] report , and [~BilwaST] for taking this.

I will help review it.

> Multithreaded loading Apps from zk statestore
> -
>
> Key: YARN-10755
> URL: https://issues.apache.org/jira/browse/YARN-10755
> Project: Hadoop YARN
>  Issue Type: Improvement
> Environment: version: hadooop-2.8.5
>Reporter: chaosju
>Assignee: Bilwa S T
>Priority: Major
> Attachments: image-2021-04-27-12-55-18-710.png
>
>
> In RM, we may be get a list of applications to be read from state store and 
> then divide the work of reading data associated with each app  to multiple 
> threads.
> I think its import to large clusters.
> h2. Profile
> Profile by  TestZKRMStateStorePerf 
> Params: -appSize 2 -appattemptsize 2 -hostPort localhost:2181 
> Profile Result: loadRMAppState stage cost is 5s.
> Profile logs:
> !image-2021-04-27-12-55-18-710.png!  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10761:
--
Attachment: YARN-10761.003.patch

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> YARN-10761.003.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340582#comment-17340582
 ] 

Qi Zhu commented on YARN-10761:
---

Fixed checkstyle in latest patch.

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> YARN-10761.003.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340529#comment-17340529
 ] 

Qi Zhu commented on YARN-10761:
---

Thanks a lot [~ebadger] for review.

I have changed the two create to one, and use a local variable to save.

Updated it in latest patch.

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> image-2021-05-06-16-38-51-406.png, image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-07 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340883#comment-17340883
 ] 

Qi Zhu commented on YARN-10763:
---

Thanks [~chaosju] for report.

If you can use aggregateContainersAllocated for root queue metrics to get the 
delta, i think it can get the throughput of the cluster.

 

> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1
>Reporter: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch, screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9927) RM multi-thread event processing mechanism

2021-05-06 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339383#comment-17339383
 ] 

Qi Zhu edited comment on YARN-9927 at 5/6/21, 6:41 AM:
---

Great review and investigation!

Thanks very much  [~ebadger]  [~gandras] .

I agree with you that we should do some stress test done via SLS or manually. 
And the more generic way of event handling is a great improvement in YARN.

I will investigate how to use SLS to confirm the improvement.

And about the test, i will change it to test both the multi-thread and the 
single one.

 


was (Author: zhuqi):
Great review and investigation!

Thanks very much  [~ebadger] [~ebadger] .

I agree with you that we should do some stress test done via SLS or manually. 
And the more generic way of event handling is a great improvement in YARN.

I will investigate how to use SLS to confirm the improvement.

And about the test, i will change it to test both the multi-thread and the 
single one.

 

> RM multi-thread event processing mechanism
> --
>
> Key: YARN-9927
> URL: https://issues.apache.org/jira/browse/YARN-9927
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0, 2.9.2
>Reporter: hcarrot
>Assignee: Qi Zhu
>Priority: Major
> Attachments: RM multi-thread event processing mechanism.pdf, 
> YARN-9927.001.patch, YARN-9927.002.patch, YARN-9927.003.patch, 
> YARN-9927.004.patch, YARN-9927.005.patch
>
>
> Recently, we have observed serious event blocking in RM event dispatcher 
> queue. After analysis of RM event monitoring data and RM event processing 
> logic, we found that
> 1) environment: a cluster with thousands of nodes
> 2) RMNodeStatusEvent dominates 90% time consumption of RM event scheduler
> 3) Meanwhile, RM event processing is in a single-thread mode, and It results 
> in the low headroom of RM event scheduler, thus performance of RM.
> So we proposed a RM multi-thread event processing mechanism to improve RM 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10761:
--
Attachment: image-2021-05-06-16-39-28-362.png

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340081#comment-17340081
 ] 

Qi Zhu commented on YARN-10761:
---

[~ebadger] [~pbacsko] [~gandras] [~bilwa_st]

Could you help review this?

And i have confirmed it in test case.

Thanks.

!image-2021-05-06-16-38-51-406.png|width=736,height=84!

!image-2021-05-06-16-39-28-362.png|width=698,height=93!

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10761:
--
Attachment: image-2021-05-06-16-38-51-406.png

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, image-2021-05-06-16-38-51-406.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Qi Zhu (Jira)
Qi Zhu created YARN-10761:
-

 Summary: Add more event type to RM Dispatcher event metrics.
 Key: YARN-10761
 URL: https://issues.apache.org/jira/browse/YARN-10761
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Qi Zhu
Assignee: Qi Zhu


Since YARN-9615  add NodesListManagerEventType to event metrics.

And we'd better add total 4 busy event type to the metrics according to 
YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-17 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346181#comment-17346181
 ] 

Qi Zhu commented on YARN-10771:
---

Thanks [~chaosju] for this.

This is useful for user to know the event load, we can add it to cluster 
metrics. I will help review this.

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10555) Missing access check before getAppAttempts

2021-05-17 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346018#comment-17346018
 ] 

Qi Zhu commented on YARN-10555:
---

Thanks [~aajisaka] for backport.

>  Missing access check before getAppAttempts
> ---
>
> Key: YARN-10555
> URL: https://issues.apache.org/jira/browse/YARN-10555
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
>  Labels: pull-request-available, security
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10555_1.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> It seems that we miss a security check before getAppAttempts, see 
> [https://github.com/apache/hadoop/blob/513f1995adc9b73f9c7f4c7beb89725b51b313ac/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java#L1127]
> thus we can get the some sensitive information, like logs link.  
> {code:java}
> application_1609318368700_0002 belong to user2
> user1@hadoop11$ curl --negotiate -u  : 
> http://hadoop11:8088/ws/v1/cluster/apps/application_1609318368700_0002/appattempts/|jq
> {
>   "appAttempts": {
> "appAttempt": [
>   {
> "id": 1,
> "startTime": 1609318411566,
> "containerId": "container_1609318368700_0002_01_01",
> "nodeHttpAddress": "hadoop12:8044",
> "nodeId": "hadoop12:36831",
> "logsLink": 
> "http://hadoop12:8044/node/containerlogs/container_1609318368700_0002_01_01/user2;,
> "blacklistedNodes": "",
> "nodesBlacklistedBySystem": ""
>   }
> ]
>   }
> }
> {code}
> Other apis, like getApps and getApp, has access check  like "hasAccess(app, 
> hsr)", they would hide the logs link if the appid do not belong to query 
> user, see 
> [https://github.com/apache/hadoop/blob/513f1995adc9b73f9c7f4c7beb89725b51b313ac/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java#L1098]
>  We need add hasAccess(app, hsr) for getAppAttempts.
>  
> Besides, at 
> [https://github.com/apache/hadoop/blob/580a6a75a3e3d3b7918edeffd6e93fc211166884/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java#L145]
> it seems that we have  a access check in its caller,  so now i pass "true" to 
> AppAttemptInfo in the patch.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-17 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346575#comment-17346575
 ] 

Qi Zhu commented on YARN-10771:
---

Thanks [~chaosju] for patch.

I think we should change:"rm event queue size" to "rm dispatcher queue size" 
and change "scheduler event queue size" to "scheduler dispatcher queue size".

Then you should submit the patch to trigger the jenkins.

 

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch
>
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-17 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346575#comment-17346575
 ] 

Qi Zhu edited comment on YARN-10771 at 5/18/21, 3:59 AM:
-

Thanks [~chaosju] for patch.

I think we should change:"rm event queue size" to "rm dispatcher event queue 
size" and change "scheduler event queue size" to "scheduler dispatcher event 
queue size".

Then you should submit the patch to trigger the jenkins.

 


was (Author: zhuqi):
Thanks [~chaosju] for patch.

I think we should change:"rm event queue size" to "rm dispatcher queue size" 
and change "scheduler event queue size" to "scheduler dispatcher queue size".

Then you should submit the patch to trigger the jenkins.

 

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch
>
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10632) Make maximum depth allowed configurable.

2021-05-13 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343750#comment-17343750
 ] 

Qi Zhu commented on YARN-10632:
---

Thanks [~gandras] for reminder.

I will change it based YARN-10571.

> Make maximum depth allowed configurable.
> 
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10632) Make maximum depth allowed configurable.

2021-05-13 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343782#comment-17343782
 ] 

Qi Zhu commented on YARN-10632:
---

[~gandras] 

I have updated it in latest patch.

> Make maximum depth allowed configurable.
> 
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10632) Make maximum depth allowed configurable.

2021-05-13 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10632:
--
Attachment: YARN-10632.003.patch

> Make maximum depth allowed configurable.
> 
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349052#comment-17349052
 ] 

Qi Zhu commented on YARN-10779:
---

Thanks [~gandras] for reminder.

If we should enable use to reinitialize this to false, and which case we need 
change it to false for CS users.

And if we just change the static to non - static can solve this?

Thanks.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-003.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-20 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348941#comment-17348941
 ] 

Qi Zhu commented on YARN-10779:
---

Thanks [~pbacsko] for this work.

The patch LGTM, just to fix the only one checkstyle.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-003.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10657) We should make max application per queue to support node label.

2021-05-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349094#comment-17349094
 ] 

Qi Zhu commented on YARN-10657:
---

Thanks [~gandras] for reply.

We now can close before we can discuss a better solution for node label based 
max applications.

Thanks.

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-05-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349111#comment-17349111
 ] 

Qi Zhu commented on YARN-10324:
---

[~yaoguangdong] You should submitted it, make it patch available, then the 
jenkins will trigger.

With the button:

!image-2021-05-21-17-48-03-476.png|width=88,height=45!

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch, 
> YARN-10324.003.patch, image-2021-05-21-17-48-03-476.png
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-05-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349124#comment-17349124
 ] 

Qi Zhu commented on YARN-10324:
---

[~yaoguangdong] I'm not sure if you removed the original 003, and resubmitted 
it ?

Waiting for the jenkins now, if it not triggered some hours later, you should 
attached it again to trigger.

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch, 
> YARN-10324.003.patch, image-2021-05-21-17-48-03-476.png
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349065#comment-17349065
 ] 

Qi Zhu commented on YARN-10781:
---

Thanks [~zhangxiping] for this.

If you mean when the Spark dynamic resource enabled.

I think the spark will remove the idle executor, and after remove the idle 
executor, the aggregate thread will not exit? 

How is spark to handle this, can you add the related code of spark?

Thanks.

 

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10779) Add option to disable lowercase conversion in GetApplicationsRequestPBImpl and ApplicationSubmissionContextPBImpl

2021-05-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349073#comment-17349073
 ] 

Qi Zhu commented on YARN-10779:
---

Thanks [~pbacsko] for reply.

I also agree that it only affect the resource manager when rm restarted but not 
the CS related. 

[~gandras] And if we need the re-initialization the RM related property in 
future, just like reconfig namenode in HDFS, we can let it to be non-static, 
but now i think is fine to be static.

Thanks.

> Add option to disable lowercase conversion in GetApplicationsRequestPBImpl 
> and ApplicationSubmissionContextPBImpl
> -
>
> Key: YARN-10779
> URL: https://issues.apache.org/jira/browse/YARN-10779
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10779-001.patch, YARN-10779-002.patch, 
> YARN-10779-003.patch, YARN-10779-POC.patch
>
>
> In both {{GetApplicationsRequestPBImpl}} and 
> {{ApplicationSubmissionContextPBImpl}}, there is a forced lowercase 
> conversion:
> {noformat}
> checkTags(tags);
> // Convert applicationTags to lower case and add
> this.applicationTags = new TreeSet<>();
> for (String tag : tags) {
>   this.applicationTags.add(StringUtils.toLowerCase(tag));
> }
>   }
> {noformat}
> However, we encountered some cases where this is not desirable for "userid" 
> tags. 
> Proposed solution: since both classes are pretty low-level and can be often 
> instantiated, a {{Configuration}} object which loads {{yarn-site.xml}} should 
> be cached inside them. A new property should be created which tells whether 
> lowercase conversion should occur or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-05-21 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10324:
--
Attachment: image-2021-05-21-17-48-03-476.png

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch, 
> YARN-10324.003.patch, image-2021-05-21-17-48-03-476.png
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10543) Timeline Server V1.5 not supporting audit log

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348066#comment-17348066
 ] 

Qi Zhu commented on YARN-10543:
---

Thanks [~gb.ana...@gmail.com] for patch.

The patch generally LGTM.

But we'd better to add a simple unit test to intercept the audit log.

> Timeline Server V1.5 not supporting audit log
> -
>
> Key: YARN-10543
> URL: https://issues.apache.org/jira/browse/YARN-10543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.1.1
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
>  Labels: TimeLine
> Attachments: YARN-10543-001.patch, YARN-10543-002.patch
>
>
> Like JHS, TS V1.5 can also support audit log when Timeline REST APIs are 
> accessed. This will helps to know the operation performed on TS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347313#comment-17347313
 ] 

Qi Zhu edited comment on YARN-10771 at 5/20/21, 2:29 AM:
-

Thanks [~chaosju] for update.

The patch LGTM now.

Waiting [~pbacsko] [~ebadger]  for double check.

Thanks.


was (Author: zhuqi):
Thanks [~chaosju] for update.

The patch LGTM now.

Waiting [~pbacsko] for double check.

Thanks.

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch, YARN-10771.002.patch, 
> YARN-10771.003.patch, YARN-10771.004.patch
>
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10771) Add cluster metric for size of SchedulerEventQueue and RMEventQueue

2021-05-18 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347313#comment-17347313
 ] 

Qi Zhu commented on YARN-10771:
---

Thanks [~chaosju] for update.

The patch LGTM now.

Waiting [~pbacsko] for double check.

Thanks.

> Add cluster metric for size of SchedulerEventQueue and RMEventQueue
> ---
>
> Key: YARN-10771
> URL: https://issues.apache.org/jira/browse/YARN-10771
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: chaosju
>Assignee: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch, YARN-10771.002.patch, 
> YARN-10771.003.patch, YARN-10771.004.patch
>
>
> Add cluster metric for size of Scheduler event queue and RM event queue, This 
> lets us know the load of the RM and convenient monitoring the metrics.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10545) Improve the readability of diagnostics log in yarn-ui2 web page.

2021-05-14 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu resolved YARN-10545.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> Improve the readability of diagnostics log in yarn-ui2 web page.
> 
>
> Key: YARN-10545
> URL: https://issues.apache.org/jira/browse/YARN-10545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: Diagnostics shows unreadble.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If the diagnostic log in yarn-ui2 has multiple lines, line breaks and spaces 
> will not be displayed, which is hard to read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10545) Improve the readability of diagnostics log in yarn-ui2 web page.

2021-05-14 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reassigned YARN-10545:
-

Assignee: akiyamaneko

> Improve the readability of diagnostics log in yarn-ui2 web page.
> 
>
> Key: YARN-10545
> URL: https://issues.apache.org/jira/browse/YARN-10545
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
>  Labels: pull-request-available
> Attachments: Diagnostics shows unreadble.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If the diagnostic log in yarn-ui2 has multiple lines, line breaks and spaces 
> will not be displayed, which is hard to read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344478#comment-17344478
 ] 

Qi Zhu edited comment on YARN-10324 at 5/14/21, 9:01 AM:
-

Hi [~yaoguangdong] 

Thanks for this work. I have added you to the contributor list and assigned 
this to you.

You can submit latest patch to trigger the jenkins.

 


was (Author: zhuqi):
Hi [~yaoguangdong] 

Thanks for this work. I have added you to the contributor list.

You can submit latest patch to trigger the jenkins.

 

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-14 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344483#comment-17344483
 ] 

Qi Zhu edited comment on YARN-10761 at 5/14/21, 9:16 AM:
-

Thanks [~snemeth] for reminder.

Sorry for the commit.
 The YARN-9615 is contributed by me, so i commit this related small change.

I will wait other committers to check and commit when (more than 2 +1) next 
time.

 


was (Author: zhuqi):
Thanks [~snemeth] for reminder.

Sorry for the commit.
The YARN-9615 is contributed by me, so i commit this related small change.

I will wait other committers to check and commit next time.

 

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10761.001.patch, YARN-10761.002.patch, 
> YARN-10761.003.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >