[
https://issues.apache.org/jira/browse/YARN-11809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated YARN-11809:
----------------------------------
Labels: pull-request-available (was: )
> Support application backoff mechanism for CapacityScheduler
> -----------------------------------------------------------
>
> Key: YARN-11809
> URL: https://issues.apache.org/jira/browse/YARN-11809
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Major
> Labels: pull-request-available
>
> Currently, when an application repeatedly fails to schedule tasks due to
> resource constraints or other issues, it continues to be considered in every
> scheduling cycle, potentially causing unnecessary scheduling overhead and
> resource contention. This can lead to inefficient resource utilization and
> increased scheduling latency. This is especially impactful in global
> scheduling where the scheduler needs to consider resources across the entire
> cluster. The number of allocated containers per second may drop from 1000+ to
> 200+, when the scheduler is overwhelmed with repeated scheduling attempts for
> applications that cannot be satisfied.
> Thus it's necessary to introduce a new application backoff mechanism in the
> Capacity Scheduler to temporarily skip applications that fail to schedule
> tasks after a certain number of opportunities, improving the scheduling
> efficiency.
> h2. Solution
> Implement an application backoff mechanism that:
> * Tracks missed scheduling opportunities for each application
> * Temporarily skips applications that exceed a configurable threshold of
> missed opportunities
> * Automatically resumes scheduling after a configurable backoff period
> * Provides configurable parameters at both global and queue levels
> h3. Configuration Parameters
> h3. Global Configuration
> * yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff
> mechanism globally (default: false)
> * yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration
> in milliseconds (default: 3000ms)
> * yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of
> missed opportunities before backoff (default: 3)
> h3. Queue-Specific Configuration
> * yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable
> backoff mechanism for a specific queue. When enabled, applications in this
> queue will be temporarily skipped if they fail to schedule tasks after
> reaching the missed opportunities threshold. This setting can be configured
> independently for each queue, allowing for fine-grained control over which
> queues use the backoff mechanism. If not specified, it inherits the global
> setting from yarn.scheduler.capacity.app-backoff.enabled.
> * yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff
> duration in milliseconds for a specific queue. If not specified, it inherits
> the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.
> * yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number
> of missed opportunities before backoff for a specific queue. If not
> specified, it inherits the global setting from
> yarn.scheduler.capacity.app-backoff.missed-threshold.
> Queue-specific configurations take precedence over global configurations. If
> a queue-specific configuration is not set, the queue will inherit the global
> configuration values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]