[ 
https://issues.apache.org/jira/browse/YARN-11809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17942051#comment-17942051
 ] 

ASF GitHub Bot commented on YARN-11809:
---------------------------------------

TaoYang526 opened a new pull request, #7589:
URL: https://github.com/apache/hadoop/pull/7589

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   
   Please refer to YARN-11809 for details.
   
   ### How was this patch tested?
   UT
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Support application backoff mechanism for CapacityScheduler
> -----------------------------------------------------------
>
>                 Key: YARN-11809
>                 URL: https://issues.apache.org/jira/browse/YARN-11809
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>
> Currently, when an application repeatedly fails to schedule tasks due to 
> resource constraints or other issues, it continues to be considered in every 
> scheduling cycle, potentially causing unnecessary scheduling overhead and 
> resource contention. This can lead to inefficient resource utilization and 
> increased scheduling latency. This is especially impactful in global 
> scheduling where the scheduler needs to consider resources across the entire 
> cluster. The number of allocated containers per second may drop from 1000+ to 
> 200+, when the scheduler is overwhelmed with repeated scheduling attempts for 
> applications that cannot be satisfied. 
> Thus it's necessary to introduce a new application backoff mechanism in the 
> Capacity Scheduler to temporarily skip applications that fail to schedule 
> tasks after a certain number of opportunities, improving the scheduling 
> efficiency.
> h2. Solution
> Implement an application backoff mechanism that:
>  * Tracks missed scheduling opportunities for each application
>  * Temporarily skips applications that exceed a configurable threshold of 
> missed opportunities
>  *  Automatically resumes scheduling after a configurable backoff period
>  * Provides configurable parameters at both global and queue levels
> h3. Configuration Parameters
> h3. Global Configuration
>  * yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff 
> mechanism globally (default: false)
>  * yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration 
> in milliseconds (default: 3000ms)
>  * yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of 
> missed opportunities before backoff (default: 3)
> h3. Queue-Specific Configuration
>  * yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable 
> backoff mechanism for a specific queue. When enabled, applications in this 
> queue will be temporarily skipped if they fail to schedule tasks after 
> reaching the missed opportunities threshold. This setting can be configured 
> independently for each queue, allowing for fine-grained control over which 
> queues use the backoff mechanism. If not specified, it inherits the global 
> setting from yarn.scheduler.capacity.app-backoff.enabled.
>  * yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff 
> duration in milliseconds for a specific queue. If not specified, it inherits 
> the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.
>  * yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number 
> of missed opportunities before backoff for a specific queue. If not 
> specified, it inherits the global setting from 
> yarn.scheduler.capacity.app-backoff.missed-threshold.
> Queue-specific configurations take precedence over global configurations. If 
> a queue-specific configuration is not set, the queue will inherit the global 
> configuration values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to