[ https://issues.apache.org/jira/browse/YARN-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007626#comment-18007626 ]
ASF GitHub Bot commented on YARN-11834: --------------------------------------- shameersss1 commented on PR #7806: URL: https://github.com/apache/hadoop/pull/7806#issuecomment-3079394722 Thanks @zeekling for the review @slfan1989 Could you please review the changes as well? > [Capacity Scheduler] Application Stuck In ACCEPTED State due to Race Condition > ------------------------------------------------------------------------------ > > Key: YARN-11834 > URL: https://issues.apache.org/jira/browse/YARN-11834 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Affects Versions: 3.4.0, 3.4.1 > Reporter: Syed Shameerur Rahman > Assignee: Syed Shameerur Rahman > Priority: Major > Labels: pull-request-available > > It was noted that in a Hadoop 3.4.1 YARN deployment, Spark application was > stuck in ACCEPTED state even though the cluster had enough resources. > > *Steps to replicate* > 1. Launch YARN cluster total capacity ≥ 1.59 TB memory, 660 vCores or more > {{{}2.Apply the following properties{}}}{*}{{*}} > *{{capacity-scheduler}}* > *{{{}"yarn.scheduler.capacity.node-locality-delay": "-1", > "yarn.scheduler.capacity.resource-calculator": > "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"{}}}{{{},{}}}* > {*}{{"}}{*}{*}{{{}}{}}}}{*}{{{}*{\{{}yarn.scheduler.capacity.schedule-asynchronously.enable*{}}}{*}{{" > : "true"}}{*} > > *{{yarn-site}}* > *{{"yarn.log-aggregation-enable": "true",}}* > *{{{}"yarn.log-aggregation.retain-check-interval-seconds": "300", > "yarn.log-aggregation.retain-seconds": "-1", > "yarn.scheduler.capacity.max-parallel-apps": "1"{}}}{{{{}}{}}}* > 3. Submit multiple Spark jobs that launch a large number of containers. For > example: > {{spark-example --conf spark.dynamicAllocation.enabled=false --num-executors > 2000 --driver-memory 1g --executor-memory 1g --executor-cores 1 SparkPi 1000}} > > *Observations* > On analysis the logs, The following were the observations : > When Application 1 completes, there's a period where its resource requests > are still being processed or "honored" by the scheduler. During this > transition period, the following sequence could occur: > 1. Application 1 completes and releases its resources > 2. The scheduler is still processing some older allocation requests for > Application 1 > 3. During this processing, the *cul.canAssign flag* for the user is set to > false. Refer [Link#1 > |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1670]and > [Link > #2|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1268] > 4. Application 2 (which is new) tries to get resources > 5. The scheduler checks the user's cul.canAssign flag, finds it's false (due > to [cache > implementation|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1241]), > and denies resources to Application 2 > 6. Application 2 remains in ACCEPTED state despite available resources > This race condition occurs because the user's resource usage state (tracked > in the CapacityUsageLimit object) isn't properly reset or synchronized > between the completion of one application and the scheduling of another. > > *Solutions* > I can think of two solution for this race condition > # *Cache Invalidation* : Invalidate the cache when no user information is > fetched [here > |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1669]by > doing this the new application (by the same user) will be forced to > calculate new userLimits. The problem with this problem is repeated > calculation of userLimits > # *Skip setting cul.canAssign flag :* In this approach setting of > cul.canAssign flag can be ignored if the application is already completed / > removed from the applicationAttempt list - Refer > [this|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractLeafQueue.java#L1267] > code pointer > > I am personally inclined to approach 2 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org