Tao Yang created YARN-11809:
-------------------------------

             Summary: Support application backoff mechanism for 
CapacityScheduler
                 Key: YARN-11809
                 URL: https://issues.apache.org/jira/browse/YARN-11809
             Project: Hadoop YARN
          Issue Type: Improvement
            Reporter: Tao Yang
            Assignee: Tao Yang


Currently, when an application repeatedly fails to schedule tasks due to 
resource constraints or other issues, it continues to be considered in every 
scheduling cycle, potentially causing unnecessary scheduling overhead and 
resource contention. This can lead to inefficient resource utilization and 
increased scheduling latency. This is especially impactful in global scheduling 
where the scheduler needs to consider resources across the entire cluster. The 
number of allocated containers per second may drop from 1000+ to 200+, when the 
scheduler is overwhelmed with repeated scheduling attempts for applications 
that cannot be satisfied. 

Thus it's necessary to introduce a new application backoff mechanism in the 
Capacity Scheduler to temporarily skip applications that fail to schedule tasks 
after a certain number of opportunities, improving the scheduling efficiency.
h2. Solution

Implement an application backoff mechanism that:
 # Tracks missed scheduling opportunities for each application

 # Temporarily skips applications that exceed a configurable threshold of 
missed opportunities

 # Automatically resumes scheduling after a configurable backoff period

 # Provides configurable parameters at both global and queue levels

h3. Configuration Parameters
h3. Global Configuration
 * yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff 
mechanism globally (default: false)

 * yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration in 
milliseconds (default: 3000ms)

 * yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of 
missed opportunities before backoff (default: 3)

h3. Queue-Specific Configuration
 * yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable 
backoff mechanism for a specific queue. When enabled, applications in this 
queue will be temporarily skipped if they fail to schedule tasks after reaching 
the missed opportunities threshold. This setting can be configured 
independently for each queue, allowing for fine-grained control over which 
queues use the backoff mechanism. If not specified, it inherits the global 
setting from yarn.scheduler.capacity.app-backoff.enabled.

 * yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff 
duration in milliseconds for a specific queue. If not specified, it inherits 
the global setting from yarn.scheduler.capacity.app-backoff.interval-ms.

 * yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number of 
missed opportunities before backoff for a specific queue. If not specified, it 
inherits the global setting from 
yarn.scheduler.capacity.app-backoff.missed-threshold.

Queue-specific configurations take precedence over global configurations. If a 
queue-specific configuration is not set, the queue will inherit the global 
configuration values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to