Public bug reported:

Maybe "race" isn't the right word, but the ordering of the
Instance.save() of ERROR state and the creation of the instance.fault
record can result in the instance.fault not having been created yet
after the instance is visibly showing ERROR state.

Seen in the gate:

test_qos_min_bw_allocation_basic fails because the expected 'fault' field is 
missing in the server
Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' 
field is not found in the server from GET /servers/{server_id}

testr_results.html:

testtools.matchers._impl.MismatchError: 'fault' not in {'id':
'19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
'8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
[{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0,
'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}},
'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z',
'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel':
'self', 'href':
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
volumes:volumes_attached': [], 'locked': False, 'description': None,
'tags': [], 'trusted_image_certificates': None, 'server_groups': []}

screen-placement-api.txt:

found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
this ^ is expected for this part of the test

OpenSearch query:

message:"testtools.matchers._impl.MismatchError: 'fault' not in"

Comments:

This may be a race because ERROR state is set on the instance and save()'ed 
before the 'fault' record is created
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
and the test waits for ERROR state before checking for the 'fault' field, so 
maybe sometimes it GETs the instance before the fault was able to be added
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276

Code:

    vm_state = updates['vm_state']
    properties = request_spec.get('instance_properties', {})
    notifier = rpc.get_notifier(service)
    state = vm_state.upper()
    LOG.warning('Setting instance to %s state.', state,
                instance_uuid=instance_uuid)

    instance = objects.Instance(context=context, uuid=instance_uuid,
                                **updates)
    instance.obj_reset_changes(['uuid'])
    instance.save()
    compute_utils.add_instance_fault_from_exc(
        context, instance, ex, sys.exc_info())


I wonder if it would be legit to swap the ordering to do 
compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR 
state?

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: gate-failure

** Description changed:

  Maybe "race" isn't the right word, but the ordering of the
  Instance.save() of ERROR state and the creation of the instance.fault
  record can result in the instance.fault not having been created yet
  after the instance is visibly showing ERROR state.
  
  Seen in the gate:
  
  test_qos_min_bw_allocation_basic fails because the expected 'fault' field is 
missing in the server
  Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' 
field is not found in the server from GET /servers/{server_id}
  
  testr_results.html:
  
  testtools.matchers._impl.MismatchError: 'fault' not in {'id':
  '19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
  MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
  'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
  '8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
  'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
  [{'rel': 'bookmark', 'href':
  
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
  'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0,
  'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}},
  'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z',
  'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel':
  'self', 'href':
  
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
  {'rel': 'bookmark', 'href':
  
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
  'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
  'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
  'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
  EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
  volumes:volumes_attached': [], 'locked': False, 'description': None,
  'tags': [], 'trusted_image_certificates': None, 'server_groups': []}
  
  screen-placement-api.txt:
  
  found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
  this ^ is expected for this part of the test
  
  OpenSearch query:
  
  message:"testtools.matchers._impl.MismatchError: 'fault' not in"
  
  Comments:
  
  This may be a race because ERROR state is set on the instance and save()'ed 
before the 'fault' record is created
  
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
  and the test waits for ERROR state before checking for the 'fault' field, so 
maybe sometimes it GETs the instance before the fault was able to be added
  
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276
+ 
+ Code:
+ 
+     vm_state = updates['vm_state']
+     properties = request_spec.get('instance_properties', {})
+     notifier = rpc.get_notifier(service)
+     state = vm_state.upper()
+     LOG.warning('Setting instance to %s state.', state,
+                 instance_uuid=instance_uuid)
+ 
+     instance = objects.Instance(context=context, uuid=instance_uuid,
+                                 **updates)
+     instance.obj_reset_changes(['uuid'])
+     instance.save()
+     compute_utils.add_instance_fault_from_exc(
+         context, instance, ex, sys.exc_info())
+ 
+ 
+ I wonder if it would be legit to swap the ordering to do 
compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR 
state?

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2045921

Title:
  scheduler.utils.set_vm_state_and_notify() race between ERROR state
  save() and compute_utils.add_instance_fault_from_ex()

Status in OpenStack Compute (nova):
  New

Bug description:
  Maybe "race" isn't the right word, but the ordering of the
  Instance.save() of ERROR state and the creation of the instance.fault
  record can result in the instance.fault not having been created yet
  after the instance is visibly showing ERROR state.

  Seen in the gate:

  test_qos_min_bw_allocation_basic fails because the expected 'fault' field is 
missing in the server
  Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' 
field is not found in the server from GET /servers/{server_id}

  testr_results.html:

  testtools.matchers._impl.MismatchError: 'fault' not in {'id':
  '19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
  MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
  'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
  '8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
  'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
  [{'rel': 'bookmark', 'href':
  
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
  'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap':
  0, 'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed':
  'True'}}, 'created': '2023-12-07T15:00:24Z', 'updated':
  '2023-12-07T15:00:30Z', 'addresses': {}, 'accessIPv4': '',
  'accessIPv6': '', 'links': [{'rel': 'self', 'href':
  
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
  {'rel': 'bookmark', 'href':
  
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
  'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
  'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
  'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
  EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
  volumes:volumes_attached': [], 'locked': False, 'description': None,
  'tags': [], 'trusted_image_certificates': None, 'server_groups': []}

  screen-placement-api.txt:

  found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
  this ^ is expected for this part of the test

  OpenSearch query:

  message:"testtools.matchers._impl.MismatchError: 'fault' not in"

  Comments:

  This may be a race because ERROR state is set on the instance and save()'ed 
before the 'fault' record is created
  
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
  and the test waits for ERROR state before checking for the 'fault' field, so 
maybe sometimes it GETs the instance before the fault was able to be added
  
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276

  Code:

      vm_state = updates['vm_state']
      properties = request_spec.get('instance_properties', {})
      notifier = rpc.get_notifier(service)
      state = vm_state.upper()
      LOG.warning('Setting instance to %s state.', state,
                  instance_uuid=instance_uuid)

      instance = objects.Instance(context=context, uuid=instance_uuid,
                                  **updates)
      instance.obj_reset_changes(['uuid'])
      instance.save()
      compute_utils.add_instance_fault_from_exc(
          context, instance, ex, sys.exc_info())

  
  I wonder if it would be legit to swap the ordering to do 
compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR 
state?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2045921/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to