[
https://issues.apache.org/jira/browse/YARN-11809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Yang updated YARN-11809:
----------------------------
Description:
Currently, when an application repeatedly fails to schedule tasks due to
resource constraints or other issues, it continues to be considered in every
scheduling cycle, potentially causing unnecessary scheduling overhead and
resource contention. This can lead to inefficient resource utilization and
increased scheduling latency. This is especially impactful in global scheduling
where the scheduler needs to consider resources across the entire cluster. The
number of allocated containers per second may drop from 1000+ to 200+, when the
scheduler is overwhelmed with repeated scheduling attempts for applications
that cannot be satisfied.
Thus it's necessary to introduce a new application backoff mechanism in the
Capacity Scheduler to temporarily skip applications that fail to schedule tasks
after a certain number of opportunities, improving the scheduling efficiency.
h2. Solution
Implement an application backoff mechanism that:
* Tracks missed scheduling opportunities for each application
* Temporarily skips applications that exceed a configurable threshold of
missed opportunities
* Automatically resumes scheduling after a configurable backoff period
* Provides configurable parameters at both global and queue levels
h3. Configuration Parameters
h3. Global Configuration
* yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff
mechanism globally (default: false)
* yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration in
milliseconds (default: 3000ms)
* yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of
missed opportunities before backoff (default: 3)
h3. Queue-Specific Configuration
* yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable
backoff mechanism for a specific queue. When enabled, applications in this
queue will be temporarily skipped if they fail to schedule tasks after reaching
the missed opportunities threshold. This setting can be configured
independently for each queue, allowing for fine-grained control over which
queues use the backoff mechanism. If not specified, it inherits the global
setting from yarn.scheduler.capacity.app-backoff.enabled.
* yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff
duration in milliseconds for a specific queue. If not specified, it inherits
the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.
* yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number of
missed opportunities before backoff for a specific queue. If not specified, it
inherits the global setting from
yarn.scheduler.capacity.app-backoff.missed-threshold.
Queue-specific configurations take precedence over global configurations. If a
queue-specific configuration is not set, the queue will inherit the global
configuration values.
was:
Currently, when an application repeatedly fails to schedule tasks due to
resource constraints or other issues, it continues to be considered in every
scheduling cycle, potentially causing unnecessary scheduling overhead and
resource contention. This can lead to inefficient resource utilization and
increased scheduling latency. This is especially impactful in global scheduling
where the scheduler needs to consider resources across the entire cluster. The
number of allocated containers per second may drop from 1000+ to 200+, when the
scheduler is overwhelmed with repeated scheduling attempts for applications
that cannot be satisfied.
Thus it's necessary to introduce a new application backoff mechanism in the
Capacity Scheduler to temporarily skip applications that fail to schedule tasks
after a certain number of opportunities, improving the scheduling efficiency.
h2. Solution
Implement an application backoff mechanism that:
# Tracks missed scheduling opportunities for each application
# Temporarily skips applications that exceed a configurable threshold of
missed opportunities
# Automatically resumes scheduling after a configurable backoff period
# Provides configurable parameters at both global and queue levels
h3. Configuration Parameters
h3. Global Configuration
* yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff
mechanism globally (default: false)
* yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration in
milliseconds (default: 3000ms)
* yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of
missed opportunities before backoff (default: 3)
h3. Queue-Specific Configuration
* yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable
backoff mechanism for a specific queue. When enabled, applications in this
queue will be temporarily skipped if they fail to schedule tasks after reaching
the missed opportunities threshold. This setting can be configured
independently for each queue, allowing for fine-grained control over which
queues use the backoff mechanism. If not specified, it inherits the global
setting from yarn.scheduler.capacity.app-backoff.enabled.
* yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff
duration in milliseconds for a specific queue. If not specified, it inherits
the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.
* yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number of
missed opportunities before backoff for a specific queue. If not specified, it
inherits the global setting from
yarn.scheduler.capacity.app-backoff.missed-threshold.
Queue-specific configurations take precedence over global configurations. If a
queue-specific configuration is not set, the queue will inherit the global
configuration values.
> Support application backoff mechanism for CapacityScheduler
> -----------------------------------------------------------
>
> Key: YARN-11809
> URL: https://issues.apache.org/jira/browse/YARN-11809
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Major
>
> Currently, when an application repeatedly fails to schedule tasks due to
> resource constraints or other issues, it continues to be considered in every
> scheduling cycle, potentially causing unnecessary scheduling overhead and
> resource contention. This can lead to inefficient resource utilization and
> increased scheduling latency. This is especially impactful in global
> scheduling where the scheduler needs to consider resources across the entire
> cluster. The number of allocated containers per second may drop from 1000+ to
> 200+, when the scheduler is overwhelmed with repeated scheduling attempts for
> applications that cannot be satisfied.
> Thus it's necessary to introduce a new application backoff mechanism in the
> Capacity Scheduler to temporarily skip applications that fail to schedule
> tasks after a certain number of opportunities, improving the scheduling
> efficiency.
> h2. Solution
> Implement an application backoff mechanism that:
> * Tracks missed scheduling opportunities for each application
> * Temporarily skips applications that exceed a configurable threshold of
> missed opportunities
> * Automatically resumes scheduling after a configurable backoff period
> * Provides configurable parameters at both global and queue levels
> h3. Configuration Parameters
> h3. Global Configuration
> * yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff
> mechanism globally (default: false)
> * yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration
> in milliseconds (default: 3000ms)
> * yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of
> missed opportunities before backoff (default: 3)
> h3. Queue-Specific Configuration
> * yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable
> backoff mechanism for a specific queue. When enabled, applications in this
> queue will be temporarily skipped if they fail to schedule tasks after
> reaching the missed opportunities threshold. This setting can be configured
> independently for each queue, allowing for fine-grained control over which
> queues use the backoff mechanism. If not specified, it inherits the global
> setting from yarn.scheduler.capacity.app-backoff.enabled.
> * yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff
> duration in milliseconds for a specific queue. If not specified, it inherits
> the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.
> * yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number
> of missed opportunities before backoff for a specific queue. If not
> specified, it inherits the global setting from
> yarn.scheduler.capacity.app-backoff.missed-threshold.
> Queue-specific configurations take precedence over global configurations. If
> a queue-specific configuration is not set, the queue will inherit the global
> configuration values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]