Public bug reported:
We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri (nova
21.2.5.dev27) where the cell0 control plane is upgraded
and the cell controllers are all on the same nova version.
We only left the nova-compute nodes running at the prior version to do a
upgrade cell by cell.
But now we realized we got the nova-conductor reporting errors like
ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -]
Error updating resources for node sc9-1-hv329:
oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting
value: line 1 column 1 (char 0)
Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent call
last):\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py",
line 139, in _object_dispatch\n return getattr(target, method)(*args,
**kwargs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py",
line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1333, in get_by_host_and_node\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1238, in _make_instance_list\n expected_attrs=expected_attrs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 441, in _from_db_object\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 502, in _extra_attributes_from_db_object\n
db_inst[\'extra\'].get(\'resources\'))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1025, in _load_resources\n jsonutils.loads(db_resources))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py",
line 249, in loads\n return json.loads(encodeutils.safe_decode(s,
encoding), **kwargs)\n', ' File "/usr/lib/python3.6/json/__init__.py", line
354, in loads\n return _default_decoder.decode(s)\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 339, in decode\n obj, end =
self.raw_decode(s, idx=_w(s, 0).end())\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n raise
JSONDecodeError("Expecting value", s, err.value) from None\n',
'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].
This error now prevents nova-compute from starting instances once they are
stopped.
So far we tracked it down to a table nova.instance_extra corruption at the
individual cell level when looking pre vs post upgrade.
The corruption seem to happen within the keypairs and following columns of the
table indicating a shift in a python class/structure.
Pre Upgrade
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
created_at: 2023-06-02 20:42:11
updated_at: 2023-06-02 20:43:48
deleted_at: NULL
deleted: 0
id: 260958
instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
pci_requests: []
flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace":
"nova", "nova_object.version": "1.3", "nova_object.data": {"objects":
[{"nova_object.name": "KeyPair", "nova_object.namespace": "nova",
"nova_object.version": "1.4", "nova_object.data": {"id": 10, "name":
"rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]}}
device_metadata: NULL
trusted_certs: NULL
vpmems: NULL
resources: NULL
Post Cell0 and cell controller upgrade:
After stop the instance_extra got corrupted (keypairs columns and following)
and you can no longer start it unless you fix the table back to the previous
state
This is post cell controller upgrade with running nova-compute at train, a
restart of the serivce doesn't change the situation
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
created_at: 2023-06-02 20:42:11
updated_at: 2023-06-05 17:19:51
deleted_at: NULL
deleted: 0
id: 260958
instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
pci_requests: []
flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace":
"nova", "nova_object.version": "1.3", "nova_object.data": {"objects":
[{"nova_object.name": "KeyPair", "nova_object.namespace": "nova",
"nova_object.version": "1.4", "nova_object.data": {"id": 10, "name":
"rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]¤
device_metadata: NULL
trusted_certs: NULL
vpmems: +b$=
Wûd °EJ°Jô¨
c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
resources: nceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
At this point we are accelerating the nova-compute upgrades to see if that
fixes it
If that is the case then N-1 is not working with respect to a cellsv2
deployment.
So far we haven't found the issue in the code yet and would appreciate feedback
where to look
** Affects: nova
Importance: Undecided
Status: New
** Description changed:
We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri
(nova 21.2.5.dev27) where the cell0 control plane is upgraded
and the cell controllers are all on the same nova version.
We only left the nova-compute nodes running at the prior version to do a
upgrade cell by cell.
But now we realized we got the nova-conductor reporting errors like
- ```
ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - -
-] Error updating resources for node
us01odc-sc9-1-hv329.us01-odc.synopsys.com.:
oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting
value: line 1 column 1 (char 0)
Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent
call last):\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py",
line 139, in _object_dispatch\n return getattr(target, method)(*args,
**kwargs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py",
line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1333, in get_by_host_and_node\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1238, in _make_instance_list\n expected_attrs=expected_attrs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 441, in _from_db_object\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 502, in _extra_attributes_from_db_object\n
db_inst[\'extra\'].get(\'resources\'))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1025, in _load_resources\n jsonutils.loads(db_resources))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py",
line 249, in loads\n return json.loads(encodeutils.safe_decode(s,
encoding), **kwargs)\n', ' File "/usr/lib/python3.6/json/__init__.py", line
354, in loads\n return _default_decoder.decode(s)\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 339, in decode\n obj, end =
self.raw_decode(s, idx=_w(s, 0).end())\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n raise
JSONDecodeError("Expecting value", s, err.value) from None\n',
'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].
- ```
+
This error now prevents nova-compute from starting instances once they are
stopped.
So far we tracked it down to a table nova.instance_extra corruption at the
individual cell level when looking pre vs post upgrade.
- The corruption seem to happen within the keypairs and following colums of the
table indicating a shift in a python class/structure.
+ The corruption seem to happen within the keypairs and following columns of
the table indicating a shift in a python class/structure.
Pre Upgrade
- ```
+
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
- created_at: 2023-06-02 20:42:11
- updated_at: 2023-06-02 20:43:48
- deleted_at: NULL
- deleted: 0
- id: 260958
- instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
- numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
- pci_requests: []
- flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
- vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
+ created_at: 2023-06-02 20:42:11
+ updated_at: 2023-06-02 20:43:48
+ deleted_at: NULL
+ deleted: 0
+ id: 260958
+ instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
+ numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
+ pci_requests: []
+ flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
+ vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
- keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]}}
- device_metadata: NULL
- trusted_certs: NULL
- vpmems: NULL
- resources: NULL
- ```
+ keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]}}
+ device_metadata: NULL
+ trusted_certs: NULL
+ vpmems: NULL
+ resources: NULL
+
Post Cell0 and cell controller upgrade:
- After stop the instance_extra got corrupted (keypairs colums and following)
and you can no longer start it unless you fix the table back to the previous
state
+ After stop the instance_extra got corrupted (keypairs columns and following)
and you can no longer start it unless you fix the table back to the previous
state
This is post cell controller upgrade with running nova-compute at train, a
restart of the serivce doesn't change the situation
- ```
+
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
- created_at: 2023-06-02 20:42:11
- updated_at: 2023-06-05 17:19:51
- deleted_at: NULL
- deleted: 0
- id: 260958
- instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
- numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
- pci_requests: []
- flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
- vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
+ created_at: 2023-06-02 20:42:11
+ updated_at: 2023-06-05 17:19:51
+ deleted_at: NULL
+ deleted: 0
+ id: 260958
+ instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
+ numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
+ pci_requests: []
+ flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
+ vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
- keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]¤
- device_metadata: NULL
- trusted_certs: NULL
- vpmems: +b$=
- Wûd °EJ°Jô¨
c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
- resources: nceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
- ```
+ keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]¤
+ device_metadata: NULL
+ trusted_certs: NULL
+ vpmems: +b$=
+ Wûd °EJ°Jô¨
c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
+ resources: nceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
+
At this point we are accelerating the nova-compute upgrades to see if that
fixes it
If that is the case then N-1 is not working with respect to a cellsv2
deployment.
So far we haven't found the issue in the code yet and would appreciate
feedback where to look
** Description changed:
We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri
(nova 21.2.5.dev27) where the cell0 control plane is upgraded
and the cell controllers are all on the same nova version.
We only left the nova-compute nodes running at the prior version to do a
upgrade cell by cell.
But now we realized we got the nova-conductor reporting errors like
- ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - -
-] Error updating resources for node
us01odc-sc9-1-hv329.us01-odc.synopsys.com.:
oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting
value: line 1 column 1 (char 0)
+ ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - -
-] Error updating resources for node sc9-1-hv329:
oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting
value: line 1 column 1 (char 0)
Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent
call last):\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py",
line 139, in _object_dispatch\n return getattr(target, method)(*args,
**kwargs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py",
line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1333, in get_by_host_and_node\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1238, in _make_instance_list\n expected_attrs=expected_attrs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 441, in _from_db_object\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 502, in _extra_attributes_from_db_object\n
db_inst[\'extra\'].get(\'resources\'))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1025, in _load_resources\n jsonutils.loads(db_resources))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py",
line 249, in loads\n return json.loads(encodeutils.safe_decode(s,
encoding), **kwargs)\n', ' File "/usr/lib/python3.6/json/__init__.py", line
354, in loads\n return _default_decoder.decode(s)\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 339, in decode\n obj, end =
self.raw_decode(s, idx=_w(s, 0).end())\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n raise
JSONDecodeError("Expecting value", s, err.value) from None\n',
'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].
-
This error now prevents nova-compute from starting instances once they are
stopped.
So far we tracked it down to a table nova.instance_extra corruption at the
individual cell level when looking pre vs post upgrade.
The corruption seem to happen within the keypairs and following columns of
the table indicating a shift in a python class/structure.
Pre Upgrade
-
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
created_at: 2023-06-02 20:42:11
updated_at: 2023-06-02 20:43:48
deleted_at: NULL
deleted: 0
id: 260958
instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
pci_requests: []
flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]}}
device_metadata: NULL
trusted_certs: NULL
vpmems: NULL
resources: NULL
-
Post Cell0 and cell controller upgrade:
After stop the instance_extra got corrupted (keypairs columns and following)
and you can no longer start it unless you fix the table back to the previous
state
This is post cell controller upgrade with running nova-compute at train, a
restart of the serivce doesn't change the situation
-
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
created_at: 2023-06-02 20:42:11
updated_at: 2023-06-05 17:19:51
deleted_at: NULL
deleted: 0
id: 260958
instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
pci_requests: []
flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]¤
device_metadata: NULL
trusted_certs: NULL
vpmems: +b$=
Wûd °EJ°Jô¨
c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
resources: nceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
-
At this point we are accelerating the nova-compute upgrades to see if that
fixes it
If that is the case then N-1 is not working with respect to a cellsv2
deployment.
So far we haven't found the issue in the code yet and would appreciate
feedback where to look
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2022967
Title:
instance_extra corrupts on N-1 cells upgrade
Status in OpenStack Compute (nova):
New
Bug description:
We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri
(nova 21.2.5.dev27) where the cell0 control plane is upgraded
and the cell controllers are all on the same nova version.
We only left the nova-compute nodes running at the prior version to do a
upgrade cell by cell.
But now we realized we got the nova-conductor reporting errors like
ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - -
-] Error updating resources for node sc9-1-hv329:
oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting
value: line 1 column 1 (char 0)
Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent
call last):\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py",
line 139, in _object_dispatch\n return getattr(target, method)(*args,
**kwargs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py",
line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1333, in get_by_host_and_node\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1238, in _make_instance_list\n expected_attrs=expected_attrs)\n', '
File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 441, in _from_db_object\n expected_attrs)\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 502, in _extra_attributes_from_db_object\n
db_inst[\'extra\'].get(\'resources\'))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py",
line 1025, in _load_resources\n jsonutils.loads(db_resources))\n', ' File
"/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py",
line 249, in loads\n return json.loads(encodeutils.safe_decode(s,
encoding), **kwargs)\n', ' File "/usr/lib/python3.6/json/__init__.py", line
354, in loads\n return _default_decoder.decode(s)\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 339, in decode\n obj, end =
self.raw_decode(s, idx=_w(s, 0).end())\n', ' File
"/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n raise
JSONDecodeError("Expecting value", s, err.value) from None\n',
'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].
This error now prevents nova-compute from starting instances once they are
stopped.
So far we tracked it down to a table nova.instance_extra corruption at the
individual cell level when looking pre vs post upgrade.
The corruption seem to happen within the keypairs and following columns of
the table indicating a shift in a python class/structure.
Pre Upgrade
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
created_at: 2023-06-02 20:42:11
updated_at: 2023-06-02 20:43:48
deleted_at: NULL
deleted: 0
id: 260958
instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
pci_requests: []
flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]}}
device_metadata: NULL
trusted_certs: NULL
vpmems: NULL
resources: NULL
Post Cell0 and cell controller upgrade:
After stop the instance_extra got corrupted (keypairs columns and following)
and you can no longer start it unless you fix the table back to the previous
state
This is post cell controller upgrade with running nova-compute at train, a
restart of the serivce doesn't change the situation
MariaDB [(none)]> select * from nova.instance_extra where instance_uuid =
'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
created_at: 2023-06-02 20:42:11
updated_at: 2023-06-05 17:19:51
deleted_at: NULL
deleted: 0
id: 260958
instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
numa_topology: {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null,
"cpu_topology": {"nova_object.name": "VirtCPUTopology",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"sockets": 1, "cores": 1, "threads": 1},
"nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw":
{"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer",
"cpuset_reserved": null}, "nova_object.changes": ["cpu_topology",
"cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
pci_requests: []
flavor: {"cur": {"nova_object.name": "Flavor",
"nova_object.namespace": "nova", "nova_object.version": "1.2",
"nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512,
"vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid":
"d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0,
"vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs":
{"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2",
"hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy":
"prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z",
"updated_at": null, "deleted_at": null, "deleted": false},
"nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
vcpu_model: {"nova_object.name": "VirtCPUModel",
"nova_object.namespace": "nova", "nova_object.version": "1.0",
"nova_object.data": {"arch": null, "vendor": null, "topology":
{"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova",
"nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1,
"threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]},
"features": [], "mode": "host-passthrough", "model": null, "match": "exact"},
"nova_object.changes": ["features", "vendor", "topology", "model", "mode",
"match", "arch"]}
migration_context: NULL
keypairs: {"nova_object.name": "KeyPairList",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"objects": [{"nova_object.name": "KeyPair",
"nova_object.namespace": "nova", "nova_object.version": "1.4",
"nova_object.data": {"id": 10, "name": "rpc_support", "user_id":
"5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint":
"c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7
root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z",
"updated_at": null, "deleted_at": null, "deleted": false}}]¤
device_metadata: NULL
trusted_certs: NULL
vpmems: +b$=
Wûd °EJ°Jô¨
c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
resources: nceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
At this point we are accelerating the nova-compute upgrades to see if that
fixes it
If that is the case then N-1 is not working with respect to a cellsv2
deployment.
So far we haven't found the issue in the code yet and would appreciate
feedback where to look
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2022967/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp