Tao Yang created YARN-11809: ------------------------------- Summary: Support application backoff mechanism for CapacityScheduler Key: YARN-11809 URL: https://issues.apache.org/jira/browse/YARN-11809 Project: Hadoop YARN Issue Type: Improvement Reporter: Tao Yang Assignee: Tao Yang
Currently, when an application repeatedly fails to schedule tasks due to resource constraints or other issues, it continues to be considered in every scheduling cycle, potentially causing unnecessary scheduling overhead and resource contention. This can lead to inefficient resource utilization and increased scheduling latency. This is especially impactful in global scheduling where the scheduler needs to consider resources across the entire cluster. The number of allocated containers per second may drop from 1000+ to 200+, when the scheduler is overwhelmed with repeated scheduling attempts for applications that cannot be satisfied. Thus it's necessary to introduce a new application backoff mechanism in the Capacity Scheduler to temporarily skip applications that fail to schedule tasks after a certain number of opportunities, improving the scheduling efficiency. h2. Solution Implement an application backoff mechanism that: # Tracks missed scheduling opportunities for each application # Temporarily skips applications that exceed a configurable threshold of missed opportunities # Automatically resumes scheduling after a configurable backoff period # Provides configurable parameters at both global and queue levels h3. Configuration Parameters h3. Global Configuration * yarn.scheduler.capacity.app-backoff.enabled: Enable/disable backoff mechanism globally (default: false) * yarn.scheduler.capacity.app-backoff.interval-ms: Global backoff duration in milliseconds (default: 3000ms) * yarn.scheduler.capacity.app-backoff.missed-threshold: Global number of missed opportunities before backoff (default: 3) h3. Queue-Specific Configuration * yarn.scheduler.capacity.<queue-path>.app-backoff.enabled: Enable/disable backoff mechanism for a specific queue. When enabled, applications in this queue will be temporarily skipped if they fail to schedule tasks after reaching the missed opportunities threshold. This setting can be configured independently for each queue, allowing for fine-grained control over which queues use the backoff mechanism. If not specified, it inherits the global setting from yarn.scheduler.capacity.app-backoff.enabled. * yarn.scheduler.capacity.<queue-path>.app-backoff.interval-ms: Backoff duration in milliseconds for a specific queue. If not specified, it inherits the global setting from yarn.scheduler.capacity.app-backoff.interval-ms. * yarn.scheduler.capacity.<queue-path>.app-backoff.missed-threshold: Number of missed opportunities before backoff for a specific queue. If not specified, it inherits the global setting from yarn.scheduler.capacity.app-backoff.missed-threshold. Queue-specific configurations take precedence over global configurations. If a queue-specific configuration is not set, the queue will inherit the global configuration values. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org