Public bug reported: We run OpenStack 2023.1 deployed via kolla. After upgrading from Zed -> 2023.1 we are not able to migrate various instances which have pci devices attached to it (Nvidia T4 GPU).
Nova-scheduler throws this Exception during pci filtering: Exception during message handling: TypeError: startswith first arg must be str or a tuple of str, not NoneType 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 244, in inner 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return func(*args, **kwargs) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 224, in select_destinations 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._select_destinations( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 251, in _select_destinations 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._schedule( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 388, in _schedule 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server hosts = self._get_sorted_hosts(spec_obj, hosts, num) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 672, in _get_sorted_hosts 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server filtered_hosts = self.host_manager.get_filtered_hosts(host_states, 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py", line 617, in get_filtered_hosts 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.filter_handler.get_filtered_objects(self.enabled_filters, 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in get_filtered _objects 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server list_objs = list(objs) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in filter_all 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if self._filter_one(obj, spec_obj): 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 51, in _filter_one 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.host_passes(obj, spec) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 60, in host_passes 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server good_candidates = self.filter_candidates( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 81, in filter_candidates 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if filter_func(candidate): 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 62, in <lambda> 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server lambda candidate: host_state.pci_stats.support_requests( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, in support_requests 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server stats.apply_requests(requests, provider_mapping, numa_cells) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, in apply_requests 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server rp_uuids = self._get_rp_uuids_for_request(provider_mapping, r) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, in _get_rp_uuids_for_request 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return [ 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, in <listcomp> 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if group_id.startswith(request.request_id) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: startswith first arg must be str or a tuple of str, not NoneType 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server The problematic code lies here: https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874 There are cases, where request_id has never been populated for various instances with pci devices: MariaDB [nova]> select instance_uuid, request_id from pci_devices; +--------------------------------------+--------------------------------------+ | instance_uuid | request_id | +--------------------------------------+--------------------------------------+ | NULL | NULL | | NULL | NULL | | deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL | | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | 80967831-104b-4619-9415-f819e458b307 | NULL | | d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL | | af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 | | 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b | Checking the following queries, a request_id is either missing or set to null for a given instance: [nova] select pci_requests from instance_extra where instance_uuid='<INSTANCE_UUID>' \G; [nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>' \G; Freshly spawned instances do not suffer from a missing pci request_id. Some of the problematic instances are old, spawned during the Train release. Instances spawned during the Zed release have request_id set and are able to migrate. We are able to workaround this issue by adding a newly generated request_id to the corresponding tables. ** Affects: nova Importance: Undecided Status: New ** Tags: migration -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2084238 Title: Cold-Migration fails when pci_request has nulled request_id Status in OpenStack Compute (nova): New Bug description: We run OpenStack 2023.1 deployed via kolla. After upgrading from Zed -> 2023.1 we are not able to migrate various instances which have pci devices attached to it (Nvidia T4 GPU). Nova-scheduler throws this Exception during pci filtering: Exception during message handling: TypeError: startswith first arg must be str or a tuple of str, not NoneType 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 244, in inner 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return func(*args, **kwargs) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 224, in select_destinations 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._select_destinations( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 251, in _select_destinations 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._schedule( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 388, in _schedule 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server hosts = self._get_sorted_hosts(spec_obj, hosts, num) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 672, in _get_sorted_hosts 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server filtered_hosts = self.host_manager.get_filtered_hosts(host_states, 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py", line 617, in get_filtered_hosts 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.filter_handler.get_filtered_objects(self.enabled_filters, 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in get_filtered _objects 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server list_objs = list(objs) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in filter_all 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if self._filter_one(obj, spec_obj): 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 51, in _filter_one 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.host_passes(obj, spec) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 60, in host_passes 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server good_candidates = self.filter_candidates( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 81, in filter_candidates 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if filter_func(candidate): 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 62, in <lambda> 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server lambda candidate: host_state.pci_stats.support_requests( 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, in support_requests 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server stats.apply_requests(requests, provider_mapping, numa_cells) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, in apply_requests 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server rp_uuids = self._get_rp_uuids_for_request(provider_mapping, r) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, in _get_rp_uuids_for_request 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return [ 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, in <listcomp> 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if group_id.startswith(request.request_id) 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: startswith first arg must be str or a tuple of str, not NoneType 2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server The problematic code lies here: https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874 There are cases, where request_id has never been populated for various instances with pci devices: MariaDB [nova]> select instance_uuid, request_id from pci_devices; +--------------------------------------+--------------------------------------+ | instance_uuid | request_id | +--------------------------------------+--------------------------------------+ | NULL | NULL | | NULL | NULL | | deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL | | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | NULL | NULL | | 80967831-104b-4619-9415-f819e458b307 | NULL | | d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL | | af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 | | 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b | Checking the following queries, a request_id is either missing or set to null for a given instance: [nova] select pci_requests from instance_extra where instance_uuid='<INSTANCE_UUID>' \G; [nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>' \G; Freshly spawned instances do not suffer from a missing pci request_id. Some of the problematic instances are old, spawned during the Train release. Instances spawned during the Zed release have request_id set and are able to migrate. We are able to workaround this issue by adding a newly generated request_id to the corresponding tables. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2084238/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

