[Yahoo-eng-team] [Bug 2089386] [NEW] [RFE] Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments

Serhii Ivanov Fri, 22 Nov 2024 06:01:18 -0800

Public bug reported:

Add Distributed Locking for Host Discovery Operations in Multi-Scheduler
Environments


Host discovery operations in Nova are currently vulnerable to race conditions 
and concurrent execution issues, particularly in production environments where 
multiple Nova schedulers are running simultaneously for high 
availability/redundancy, and each scheduler:
- Shares the same database backend
- Runs its own periodic automatic host discovery task
- Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same 
hosts as the schedulers

Current symptoms (due to overlapping host discovery tasks):
- Possible frequent host discovery failures, missed or incomplete host 
discoveries
- Error messages about duplicate host mappings
- Database conflicts when multiple processes try to map the same hosts 
simultaneously

Proposed Solution: Implement an opt-in distributed locking mechanism for host 
discovery operations to ensure that CLI and periodic automatic host discovery 
tasks run sequentially. The solution should:
1. Be opt-in, enabled via config option
2. Use a distributed lock (leveraging tooz.coordination) before initiating any 
host discovery operation
3. Support coordination across:
   - Scheduler automatic host discovery task
   - `nova-manage cell_v2 discover_hosts` command
4. Extend Nova configuration with an additional config option for defining 
coordinator URI

Benefits:
- Prevents race conditions during host discovery across all scenarios
- Removes the need for external complex scheduling and coordination of 
discovery jobs in high availability/redundancy setups
- Reduces operational overhead by eliminating manual conflict resolution

The solution should be configurable and work across different Nova
deployments without requiring additional external dependencies beyond
what Nova already uses for coordination. This will greatly benefit
highly available, large-scale deployments with multiple schedulers and
automated host discovery operations.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: rfe

** Description changed:

  Add Distributed Locking for Host Discovery Operations in Multi-Scheduler
  Environments
  
  Host discovery operations in Nova are currently vulnerable to race conditions 
and concurrent execution issues, particularly in production environments where 
multiple Nova schedulers are running simultaneously for high 
availability/redundancy, and each scheduler:
  - Shares the same database backend
  - Runs its own periodic automatic host discovery task
  - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same 
hosts as the schedulers
  
  Current symptoms (due to overlapping host discovery tasks):
  - Possible frequent host discovery failures, missed or incomplete host 
discoveries
  - Error messages about duplicate host mappings
  - Database conflicts when multiple processes try to map the same hosts 
simultaneously
  
  Proposed Solution: Implement an opt-in distributed locking mechanism for host 
discovery operations to ensure that CLI and periodic automatic host discovery 
tasks run sequentially. The solution should:
- 1. Use a distributed lock (leveraging tooz.coordination) before initiating 
any host discovery operation
- 2. Support coordination across:
-    - Scheduler automatic host discovery task
-    - `nova-manage cell_v2 discover_hosts` command
- 3. Extend Nova configuration with an additional config option for defining 
coordinator URI
+ 1. Be opt-in, enabled via config option
+ 2. Use a distributed lock (leveraging tooz.coordination) before initiating 
any host discovery operation
+ 3. Support coordination across:
+    - Scheduler automatic host discovery task
+    - `nova-manage cell_v2 discover_hosts` command
+ 4. Extend Nova configuration with an additional config option for defining 
coordinator URI
  
  Benefits:
  - Prevents race conditions during host discovery across all scenarios
  - Removes the need for external complex scheduling and coordination of 
discovery jobs in high availability/redundancy setups
  - Reduces operational overhead by eliminating manual conflict resolution
  
  The solution should be configurable and work across different Nova
  deployments without requiring additional external dependencies beyond
  what Nova already uses for coordination. This will greatly benefit
  highly available, large-scale deployments with multiple schedulers and
  automated host discovery operations.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2089386

Title:
  [RFE] Add Distributed Locking for Host Discovery Operations in Multi-
  Scheduler Environments

Status in OpenStack Compute (nova):
  New

Bug description:
  Add Distributed Locking for Host Discovery Operations in Multi-
  Scheduler Environments

  Host discovery operations in Nova are currently vulnerable to race conditions 
and concurrent execution issues, particularly in production environments where 
multiple Nova schedulers are running simultaneously for high 
availability/redundancy, and each scheduler:
  - Shares the same database backend
  - Runs its own periodic automatic host discovery task
  - Cron jobs run `nova-manage cell_v2 discover_hosts` periodically on the same 
hosts as the schedulers

  Current symptoms (due to overlapping host discovery tasks):
  - Possible frequent host discovery failures, missed or incomplete host 
discoveries
  - Error messages about duplicate host mappings
  - Database conflicts when multiple processes try to map the same hosts 
simultaneously

  Proposed Solution: Implement an opt-in distributed locking mechanism for host 
discovery operations to ensure that CLI and periodic automatic host discovery 
tasks run sequentially. The solution should:
  1. Be opt-in, enabled via config option
  2. Use a distributed lock (leveraging tooz.coordination) before initiating 
any host discovery operation
  3. Support coordination across:
     - Scheduler automatic host discovery task
     - `nova-manage cell_v2 discover_hosts` command
  4. Extend Nova configuration with an additional config option for defining 
coordinator URI

  Benefits:
  - Prevents race conditions during host discovery across all scenarios
  - Removes the need for external complex scheduling and coordination of 
discovery jobs in high availability/redundancy setups
  - Reduces operational overhead by eliminating manual conflict resolution

  The solution should be configurable and work across different Nova
  deployments without requiring additional external dependencies beyond
  what Nova already uses for coordination. This will greatly benefit
  highly available, large-scale deployments with multiple schedulers and
  automated host discovery operations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2089386/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2089386] [NEW] [RFE] Add Distributed Locking for Host Discovery Operations in Multi-Scheduler Environments

Reply via email to