[jira] [Commented] (YARN-11021) Define Hadoop YARN term "vcore"

2021-12-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463460#comment-17463460
 ] 

Eric Payne commented on YARN-11021:
---

https://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
This defines the property {{yarn.nodemanager.resource.cpu-vcores}}. This 
property must be placed in the configuration of each node manager. It 
advertises to the resource manager the number of virtual cores that can be used 
on the node. This value is going to vary depending on your use cases, but 
usually it is set to (#CPUs X 10).

> Define Hadoop YARN term "vcore"
> ---
>
> Key: YARN-11021
> URL: https://issues.apache.org/jira/browse/YARN-11021
> Project: Hadoop YARN
>  Issue Type: Wish
>  Components: docs, documentation
>Affects Versions: 3.3.1
>Reporter: Aleksey Tsalolikhin
>Priority: Major
>
> Hello,
> This is a request to define the Hadoop YARN term "vCore".  It's clearly 
> different than vCPU as in the number of virtual CPUs (or CPU cores) a system 
> has as per /proc/cpuinfo. What is a YARN vcore, please?
> {*}Background{*}: I am running Hadoop YARN on 24 AWS EC2 instances from the 
> R5 family (memory-intensive) with the instance size of 24 XLarge (96 vCPUs 
> and 768 GB RAM each), plus the cluster master.
> I've launched a Spark application with the following spark-submit parameters:
> {{    --executor-memory 224G}}
> {{    --conf spark.executor.memoryOverhead=23901M}}
> {{    --executor-cores 32}}
> That sets a ratio of about 250 GB of RAM (combined) to 32 vCPUs per executor; 
> I have Spark dynamic resource allocation enabled, so I expect to see three 
> executors per instance, and that's how it turns out.
> 24 nodes x 3 executors per node = 72 executors
> Plus the Application Master running on the Master node makes 73 executors.
> This matches the "73 allocated" I see in "yarn top" output in the 
> "Containers" line:
> {{    YARN top - 11:03:57, up 0d, 18:9, 1 active users, queue(s): root}}
> {{    NodeManager(s): 24 total, 24 active, 0 unhealthy, 44 decommissioned, 0 
> lost, 0 rebooted}}
> {{    Queue(s) Applications: 1 running, 1 submitted, 0 pending, 0 completed, 
> 0 killed, 0 failed}}
> {{    Queue(s) Mem(GB): 183 available, 17809 allocated, 69008 pending, 247 
> reserved}}
> {{    Queue(s) VCores: 2230 available, 73 allocated, 279 pending, 1 reserved}}
> {{    Queue(s) Containers: 73 allocated, 279 pending, 1 reserved}}
> Most of the memory is allocated, which is as expected.
> But why does the "Queue(s) VCores" line say "73 allocated"?
> Looks like 1 VCore = 32 vCPUs?
> I looked in /etc/hadoop/conf/yarn-site.xml on one of the 24XL task
> instances with 96 vCPUs to double check how many virtual CPUs YARN thinks
> the node has, and it is 96 as expected:
> {{  }}
> {{    yarn.nodemanager.resource.cpu-vcores}}
> {{    96}}
> {{  }}
> I looked through all the Hadoop YARN documentation linked from 
> https://hadoop.apache.org/docs/stable/index.html looking for a definition of 
> a Hadoop YARN vCore and I couldn't find one.
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
>  uses "virtual cores" and "computation based resource" when talking about 
> vCores.
> What is a Hadoop YARN vCore?  How does it relate to virtual CPUs I see in 
> e.g., /proc/cpuinfo on Linux?
> There are many mentions of "vcore" in Hadoop YARN documentation; could we 
> please add a definition of this term?
> Thanks,
> Aleksey



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-21 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-10178:
-

Assignee: Andras Gyori  (was: Qi Zhu)

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request, updatePending)
>   && 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463335#comment-17463335
 ] 

Eric Payne commented on YARN-10178:
---

[~gandras], I will commit it today. Thanks!

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request,