Public bug reported:

We run OpenStack 2023.1 deployed via kolla.
After upgrading from Zed -> 2023.1 we are not able to migrate various instances 
which have pci devices attached to it (Nvidia T4 GPU).

Nova-scheduler throws this Exception during pci filtering:


Exception during message handling: TypeError: startswith first arg must be str 
or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py",
 line 165, in _process_incoming
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     res = 
self.dispatcher.dispatch(message)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 309, in dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
self._do_dispatch(endpoint, method, ctxt, args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 229, in _do_dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     result = 
func(ctxt, **new_args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py",
 line 244, in inner
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
func(*args, **kwargs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 224, in select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = 
self._select_destinations(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 251, in _select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = 
self._schedule(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 388, in _schedule
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     hosts = 
self._get_sorted_hosts(spec_obj, hosts, num)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 672, in _get_sorted_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     filtered_hosts = 
self.host_manager.get_filtered_hosts(host_states,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py",
 line 617, in get_filtered_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
self.filter_handler.get_filtered_objects(self.enabled_filters,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in 
get_filtered
_objects
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     list_objs = 
list(objs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in 
filter_all
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if 
self._filter_one(obj, spec_obj):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py",
 line 51, in _filter_one
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
self.host_passes(obj, spec)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py",
 line 60, in host_passes
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     good_candidates 
= self.filter_candidates(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py",
 line 81, in filter_candidates
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if 
filter_func(candidate):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py",
 line 62, in <lambda>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     lambda 
candidate: host_state.pci_stats.support_requests(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, 
in support_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     
stats.apply_requests(requests, provider_mapping, numa_cells)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, 
in apply_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     rp_uuids = 
self._get_rp_uuids_for_request(provider_mapping, r)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, 
in _get_rp_uuids_for_request
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return [
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, 
in <listcomp>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if 
group_id.startswith(request.request_id)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: 
startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server


The problematic code lies here:
https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874

There are cases, where request_id has never been populated for various
instances with pci devices:

MariaDB [nova]> select instance_uuid, request_id from pci_devices;              
                                              
+--------------------------------------+--------------------------------------+
| instance_uuid                        | request_id                           |
+--------------------------------------+--------------------------------------+
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| 80967831-104b-4619-9415-f819e458b307 | NULL                                 |
| d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL                                 |
| af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 |
| 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b |


Checking the following queries, a request_id is either missing or set to null 
for a given instance:
[nova] select pci_requests from instance_extra where 
instance_uuid='<INSTANCE_UUID>' \G;
[nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>' 
 \G;


Freshly spawned instances do not suffer from a missing pci request_id.
Some of the problematic instances are old, spawned during the Train release.
Instances spawned during the Zed release have request_id set and are able to 
migrate.

We are able to workaround this issue by adding a newly generated
request_id to the corresponding tables.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: migration

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2084238

Title:
  Cold-Migration fails when pci_request has nulled request_id

Status in OpenStack Compute (nova):
  New

Bug description:
  We run OpenStack 2023.1 deployed via kolla.
  After upgrading from Zed -> 2023.1 we are not able to migrate various 
instances which have pci devices attached to it (Nvidia T4 GPU).

  Nova-scheduler throws this Exception during pci filtering:

  
  Exception during message handling: TypeError: startswith first arg must be 
str or a tuple of str, not NoneType
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py",
 line 165, in _process_incoming
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     res = 
self.dispatcher.dispatch(message)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 309, in dispatch
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
self._do_dispatch(endpoint, method, ctxt, args)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py",
 line 229, in _do_dispatch
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     result = 
func(ctxt, **new_args)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py",
 line 244, in inner
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
func(*args, **kwargs)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 224, in select_destinations
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = 
self._select_destinations(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 251, in _select_destinations
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = 
self._schedule(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 388, in _schedule
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     hosts = 
self._get_sorted_hosts(spec_obj, hosts, num)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", 
line 672, in _get_sorted_hosts
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     filtered_hosts 
= self.host_manager.get_filtered_hosts(host_states,
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py",
 line 617, in get_filtered_hosts
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
self.filter_handler.get_filtered_objects(self.enabled_filters,
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in 
get_filtered
  _objects
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     list_objs = 
list(objs)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in 
filter_all
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if 
self._filter_one(obj, spec_obj):
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py",
 line 51, in _filter_one
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return 
self.host_passes(obj, spec)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py",
 line 60, in host_passes
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     
good_candidates = self.filter_candidates(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py",
 line 81, in filter_candidates
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if 
filter_func(candidate):
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py",
 line 62, in <lambda>
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     lambda 
candidate: host_state.pci_stats.support_requests(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, 
in support_requests
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     
stats.apply_requests(requests, provider_mapping, numa_cells)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, 
in apply_requests
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     rp_uuids = 
self._get_rp_uuids_for_request(provider_mapping, r)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, 
in _get_rp_uuids_for_request
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return [
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File 
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, 
in <listcomp>
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if 
group_id.startswith(request.request_id)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: 
startswith first arg must be str or a tuple of str, not NoneType
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server


  The problematic code lies here:
  
https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874

  There are cases, where request_id has never been populated for various
  instances with pci devices:

  MariaDB [nova]> select instance_uuid, request_id from pci_devices;            
                                                
  
+--------------------------------------+--------------------------------------+
  | instance_uuid                        | request_id                           
|
  
+--------------------------------------+--------------------------------------+
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 
|
  | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | NULL                                 | NULL                                 
|
  | 80967831-104b-4619-9415-f819e458b307 | NULL                                 
|
  | d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL                                 
|
  | af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 
|
  | 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b 
|

  
  Checking the following queries, a request_id is either missing or set to null 
for a given instance:
  [nova] select pci_requests from instance_extra where 
instance_uuid='<INSTANCE_UUID>' \G;
  [nova_api] select spec from request_specs where 
instance_uuid='<INSTANCE_UUID>'  \G;

  
  Freshly spawned instances do not suffer from a missing pci request_id.
  Some of the problematic instances are old, spawned during the Train release.
  Instances spawned during the Zed release have request_id set and are able to 
migrate.

  We are able to workaround this issue by adding a newly generated
  request_id to the corresponding tables.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2084238/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to