Public bug reported:

The original problem observed in a downstream deployment was of
overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking
update_available_resource periodic job.

The following reproduction is not an end-to-end reproduction, but I hope
I can demonstrate where things go wrong.

The environment is a multi-node devstack:
devstack0 - all-in-one
devstack0a - compute

Nova is backed by libvirt/qemu/kvm.

devstack 6b0f055b
nova on devstack0 39f560d673
nova on devstack0a a72f7eaac7
libvirt 8.0.0-1ubuntu7.8
qemu 1:6.2+dfsg-2ubuntu6.16
linux 5.15.0-91-generic

# Clean up if not the first run.
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack volume list --status available -f value -c ID | xargs -r openstack 
volume delete

# Create a server on devstack0.
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 
--property hw_rng:allowed=True --property hw:cpu_policy=dedicated
openstack server create --flavor cirros256-pinned --image 
cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private 
--availability-zone :devstack0 vm0 --wait

# Start a live migration to devstack0a, but simulate a failure. In my 
environment a complete live migration takes around 20 seconds. Using 'sleep 3' 
it usually breaks in the 'preparing' status.
# As far as I understand other kinds of migration (like cold migration) are 
also affected.
openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a 
sudo systemctl stop devstack@n-cpu

$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID               | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host      | Status    | Server UUID        | Old Flavor | New 
Flavor | Type           | Created At         | Updated At              |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee-     | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | preparing | a2b43180-8ad9-     |         11 |   
      11 | live-migration | 2024-01-           | 2024-01-                |
|    | 4a2c-b42a-         |             |            |                |         
     |                |           | 4c12-ad47-         |            |           
 |                | 29T12:41:40.000000 | 29T12:41:42.000000      |
|    | a73b1a19c0c9       |             |            |                |         
     |                |           | 12b8dd7a7384       |            |           
 |                |                    |                         |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+

# After some timeout (around 60 s) the migration goes to 'error' status.
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| Id | UUID                | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host      | Status | Server UUID         | Old Flavor | New 
Flavor | Type           | Created At          | Updated At           |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| 33 | c7a42f9e-dfee-4a2c- | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | error  | a2b43180-8ad9-4c12- |         11 |     
    11 | live-migration | 2024-01-            | 2024-01-             |
|    | b42a-a73b1a19c0c9   |             |            |                |        
      |                |        | ad47-12b8dd7a7384   |            |            
|                | 29T12:41:40.000000  | 29T12:42:42.000000   |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+

# Wait before restarting n-cpu on devstack0a. I don't think I fully understand 
the factors of when the migration ends up finally in failed or in error status. 
Currently it seems to me if I restart n-cpu too quickly the migration goes to 
the failed state right after restart. But for the reproduction of this bug it 
has to stay in 'error'. So far a restart only after 10+ minutes always left the 
migration in 'error' status.
sleep 600 ; ssh devstack0a sudo systemctl start devstack@n-cpu

# The server gets stuck in MIGRATING status.
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
MIGRATING

# We reset it to ACTIVE. While I don't have an at-will upstream reproduction at 
the moment without 'set --state', I believe the same end problem was observed 
in downstream environments without ever resorting to 'set --state'.
$ openstack server set --state active vm0
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
ACTIVE

# Start following n-cpu logs on devstack0a.
sudo journalctl -u [email protected] -f | egrep --line-buffered 
'ERROR|Applying migration context for instance'

# Try to live migrate the same server again, this time let it finish. It 
completes successfully. (I have occasionally observed the 2nd migration to fail 
with "nova.exception.DestinationDiskExists: The supplied disk path 
(/opt/stack/data/nova/instances/19700c16-88d9-416c-9fb8-b79bea7925ee) already 
exists, it is expected not to exist." My best guess is that I did not give 
enought time to n-cpu on devstack0a to clean up after its restart. Most of the 
time this error does not happen.)
openstack server migrate --live-migration vm0 --wait

# However in the logs of n-cpu on devstack0a we see that 
update_available_resource periodic job applies the previously failed migration 
context and keeps it applying every minute.
$ sudo journalctl -u [email protected] -f | egrep --line-buffered 
'ERROR|Applying migration context for instance'
jan 29 12:57:37 devstack0a nova-compute[34197]: DEBUG 
nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc 
None None] Applying migration context for instance 
a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress 
migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error 
{{(pid=34197) _update_available_resource 
/opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:58:38 devstack0a nova-compute[34197]: DEBUG 
nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc 
None None] Applying migration context for instance 
a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress 
migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error 
{{(pid=34197) _update_available_resource 
/opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:59:38 devstack0a nova-compute[34197]: DEBUG 
nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc 
None None] Applying migration context for instance 
a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress 
migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error 
{{(pid=34197) _update_available_resource 
/opt/stack/nova/nova/compute/resource_tracker.py:1016}}
...

# For reference we have these migrations at the end.
rubasov@devstack0:~/src/os/openstack/devstack$ openstack server migration list 
--server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID               | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host      | Status    | Server UUID        | Old Flavor | New 
Flavor | Type           | Created At         | Updated At              |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee-     | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | error     | a2b43180-8ad9-     |         11 |   
      11 | live-migration | 2024-01-           | 2024-01-                |
|    | 4a2c-b42a-         |             |            |                |         
     |                |           | 4c12-ad47-         |            |           
 |                | 29T12:41:40.000000 | 29T12:42:42.000000      |
|    | a73b1a19c0c9       |             |            |                |         
     |                |           | 12b8dd7a7384       |            |           
 |                |                    |                         |
| 34 | 20d7f349-2d5a-     | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | completed | a2b43180-8ad9-     |         11 |   
      11 | live-migration | 2024-01-           | 2024-01-                |
|    | 4796-bedc-         |             |            |                |         
     |                |           | 4c12-ad47-         |            |           
 |                | 29T12:56:57.000000 | 29T12:57:10.000000      |
|    | 8bc6e06134cf       |             |            |                |         
     |                |           | 12b8dd7a7384       |            |           
 |                |                    |                         |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+

# The relevant code.
https://opendev.org/openstack/nova/src/commit/6531ed6310c4c11ee807e10d1a0fa91658c3afea/nova/compute/resource_tracker.py#L995-L1024

I don't see it yet, how this should be fixed. Maybe a migration should
never be stuck indefinitely in status=error. To me status=failed seems
to be the proper final state. Maybe we should not apply a status=error
migration context when there is a successfully repeated migration of the
same server to the same dst host. Or at least we should not apply the
status=error migration context in every update_available_resource call.
Maybe we also should do some cleanup when resetting the server state,
however I believe the problem can appear without ever resetting the
state, so that would not be a generic fix. Let me know what you think...

Also please let me know if this is sufficient to demonstrate there is a
bug here. If needed, I can work on extending the reproduction to re-
create the actual overcommit problems, but I hope applying the wrong
migration context already shows we have a problem here and I can spare
doing that work.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: resource-tracker

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2051685

Title:
  After repeat of incomplete migration nova applies wrong (status=error)
  migration context in update_available_resource periodic job

Status in OpenStack Compute (nova):
  New

Bug description:
  The original problem observed in a downstream deployment was of
  overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking
  update_available_resource periodic job.

  The following reproduction is not an end-to-end reproduction, but I
  hope I can demonstrate where things go wrong.

  The environment is a multi-node devstack:
  devstack0 - all-in-one
  devstack0a - compute

  Nova is backed by libvirt/qemu/kvm.

  devstack 6b0f055b
  nova on devstack0 39f560d673
  nova on devstack0a a72f7eaac7
  libvirt 8.0.0-1ubuntu7.8
  qemu 1:6.2+dfsg-2ubuntu6.16
  linux 5.15.0-91-generic

  # Clean up if not the first run.
  openstack server list -f value -c ID | xargs -r openstack server delete --wait
  openstack volume list --status available -f value -c ID | xargs -r openstack 
volume delete

  # Create a server on devstack0.
  openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 
1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
  openstack server create --flavor cirros256-pinned --image 
cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private 
--availability-zone :devstack0 vm0 --wait

  # Start a live migration to devstack0a, but simulate a failure. In my 
environment a complete live migration takes around 20 seconds. Using 'sleep 3' 
it usually breaks in the 'preparing' status.
  # As far as I understand other kinds of migration (like cold migration) are 
also affected.
  openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh 
devstack0a sudo systemctl stop devstack@n-cpu

  $ openstack server migration list --server vm0 --sort-column 'Created At'
  
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
  | Id | UUID               | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host      | Status    | Server UUID        | Old Flavor | New 
Flavor | Type           | Created At         | Updated At              |
  
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
  | 33 | c7a42f9e-dfee-     | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | preparing | a2b43180-8ad9-     |         11 |   
      11 | live-migration | 2024-01-           | 2024-01-                |
  |    | 4a2c-b42a-         |             |            |                |       
       |                |           | 4c12-ad47-         |            |         
   |                | 29T12:41:40.000000 | 29T12:41:42.000000      |
  |    | a73b1a19c0c9       |             |            |                |       
       |                |           | 12b8dd7a7384       |            |         
   |                |                    |                         |
  
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+

  # After some timeout (around 60 s) the migration goes to 'error' status.
  $ openstack server migration list --server vm0 --sort-column 'Created At'
  
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
  | Id | UUID                | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host      | Status | Server UUID         | Old Flavor | New 
Flavor | Type           | Created At          | Updated At           |
  
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
  | 33 | c7a42f9e-dfee-4a2c- | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | error  | a2b43180-8ad9-4c12- |         11 |     
    11 | live-migration | 2024-01-            | 2024-01-             |
  |    | b42a-a73b1a19c0c9   |             |            |                |      
        |                |        | ad47-12b8dd7a7384   |            |          
  |                | 29T12:41:40.000000  | 29T12:42:42.000000   |
  
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+

  # Wait before restarting n-cpu on devstack0a. I don't think I fully 
understand the factors of when the migration ends up finally in failed or in 
error status. Currently it seems to me if I restart n-cpu too quickly the 
migration goes to the failed state right after restart. But for the 
reproduction of this bug it has to stay in 'error'. So far a restart only after 
10+ minutes always left the migration in 'error' status.
  sleep 600 ; ssh devstack0a sudo systemctl start devstack@n-cpu

  # The server gets stuck in MIGRATING status.
  $ openstack server show vm0 -f value -c status -c compute_host
  devstack0
  MIGRATING

  # We reset it to ACTIVE. While I don't have an at-will upstream reproduction 
at the moment without 'set --state', I believe the same end problem was 
observed in downstream environments without ever resorting to 'set --state'.
  $ openstack server set --state active vm0
  $ openstack server show vm0 -f value -c status -c compute_host
  devstack0
  ACTIVE

  # Start following n-cpu logs on devstack0a.
  sudo journalctl -u [email protected] -f | egrep --line-buffered 
'ERROR|Applying migration context for instance'

  # Try to live migrate the same server again, this time let it finish. It 
completes successfully. (I have occasionally observed the 2nd migration to fail 
with "nova.exception.DestinationDiskExists: The supplied disk path 
(/opt/stack/data/nova/instances/19700c16-88d9-416c-9fb8-b79bea7925ee) already 
exists, it is expected not to exist." My best guess is that I did not give 
enought time to n-cpu on devstack0a to clean up after its restart. Most of the 
time this error does not happen.)
  openstack server migrate --live-migration vm0 --wait

  # However in the logs of n-cpu on devstack0a we see that 
update_available_resource periodic job applies the previously failed migration 
context and keeps it applying every minute.
  $ sudo journalctl -u [email protected] -f | egrep --line-buffered 
'ERROR|Applying migration context for instance'
  jan 29 12:57:37 devstack0a nova-compute[34197]: DEBUG 
nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc 
None None] Applying migration context for instance 
a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress 
migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error 
{{(pid=34197) _update_available_resource 
/opt/stack/nova/nova/compute/resource_tracker.py:1016}}
  jan 29 12:58:38 devstack0a nova-compute[34197]: DEBUG 
nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc 
None None] Applying migration context for instance 
a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress 
migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error 
{{(pid=34197) _update_available_resource 
/opt/stack/nova/nova/compute/resource_tracker.py:1016}}
  jan 29 12:59:38 devstack0a nova-compute[34197]: DEBUG 
nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc 
None None] Applying migration context for instance 
a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress 
migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error 
{{(pid=34197) _update_available_resource 
/opt/stack/nova/nova/compute/resource_tracker.py:1016}}
  ...

  # For reference we have these migrations at the end.
  rubasov@devstack0:~/src/os/openstack/devstack$ openstack server migration 
list --server vm0 --sort-column 'Created At'
  
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
  | Id | UUID               | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host      | Status    | Server UUID        | Old Flavor | New 
Flavor | Type           | Created At         | Updated At              |
  
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
  | 33 | c7a42f9e-dfee-     | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | error     | a2b43180-8ad9-     |         11 |   
      11 | live-migration | 2024-01-           | 2024-01-                |
  |    | 4a2c-b42a-         |             |            |                |       
       |                |           | 4c12-ad47-         |            |         
   |                | 29T12:41:40.000000 | 29T12:42:42.000000      |
  |    | a73b1a19c0c9       |             |            |                |       
       |                |           | 12b8dd7a7384       |            |         
   |                |                    |                         |
  | 34 | 20d7f349-2d5a-     | devstack0   | devstack0a | devstack0      | 
devstack0a   | 192.168.122.79 | completed | a2b43180-8ad9-     |         11 |   
      11 | live-migration | 2024-01-           | 2024-01-                |
  |    | 4796-bedc-         |             |            |                |       
       |                |           | 4c12-ad47-         |            |         
   |                | 29T12:56:57.000000 | 29T12:57:10.000000      |
  |    | 8bc6e06134cf       |             |            |                |       
       |                |           | 12b8dd7a7384       |            |         
   |                |                    |                         |
  
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+

  # The relevant code.
  
https://opendev.org/openstack/nova/src/commit/6531ed6310c4c11ee807e10d1a0fa91658c3afea/nova/compute/resource_tracker.py#L995-L1024

  I don't see it yet, how this should be fixed. Maybe a migration should
  never be stuck indefinitely in status=error. To me status=failed seems
  to be the proper final state. Maybe we should not apply a status=error
  migration context when there is a successfully repeated migration of
  the same server to the same dst host. Or at least we should not apply
  the status=error migration context in every update_available_resource
  call. Maybe we also should do some cleanup when resetting the server
  state, however I believe the problem can appear without ever resetting
  the state, so that would not be a generic fix. Let me know what you
  think...

  Also please let me know if this is sufficient to demonstrate there is
  a bug here. If needed, I can work on extending the reproduction to re-
  create the actual overcommit problems, but I hope applying the wrong
  migration context already shows we have a problem here and I can spare
  doing that work.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2051685/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to