[Yahoo-eng-team] [Bug 1825386] [NEW] nova is looking for OVMF file no longer provided by latest CentOS
Public bug reported: In nova/virt/libvirt/driver.py the code looks for a hardcoded path "/usr/share/OVMF/OVMF_CODE.fd". It appears that centos 7.6 has modified the OVMF-20180508-3 rpm to no longer contain this file. Instead it now seems to be named /usr/share/OVMF/OVMF_CODE.secboot.fd This will break the ability to boot guests using UEFI. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1825386 Title: nova is looking for OVMF file no longer provided by latest CentOS Status in OpenStack Compute (nova): New Bug description: In nova/virt/libvirt/driver.py the code looks for a hardcoded path "/usr/share/OVMF/OVMF_CODE.fd". It appears that centos 7.6 has modified the OVMF-20180508-3 rpm to no longer contain this file. Instead it now seems to be named /usr/share/OVMF/OVMF_CODE.secboot.fd This will break the ability to boot guests using UEFI. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1825386/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1819216] [NEW] in devstack, "nova migrate " will try to migrate to the same host (and then fail)
Public bug reported: In multinode devstack I had an instance running on one node and tried running "nova migrate ". The operation started, but then the instance went into an error state with the following fault: {"message": "Unable to migrate instance (2bbdab8e- 3a83-43a4-8c47-ce57b653e43e) to current host (fedora-1.novalocal).", "code": 400, "created": "2019-03-08T19:59:09Z"} Logically, I think that even if "resize to same host" is enabled, for a "migrate" operation we should remove the current host from consideration. We know it's going to fail, and it doesn't make sense anyways. Also, it would probably make sense to make "migrate" work like "live migration" which removes the current host from consideration. ** Affects: nova Importance: Undecided Status: New ** Tags: compute ** Summary changed: - in devstack, "nova migrate " can try to migrate to the same host + in devstack, "nova migrate " will try to migrate to the same host (and then fail) ** Description changed: In multinode devstack I had an instance running on one node and tried running "nova migrate ". The operation started, but then the instance went into an error state with the following fault: {"message": "Unable to migrate instance (2bbdab8e- 3a83-43a4-8c47-ce57b653e43e) to current host (fedora-1.novalocal).", "code": 400, "created": "2019-03-08T19:59:09Z"} Logically, I think that even if "resize to same host" is enabled, for a "migrate" operation we should remove the current host from consideration. We know it's going to fail, and it doesn't make sense anyways. + + Also, it would probably make sense to make "migrate" work like "live + migration" which removes the current host from consideration. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1819216 Title: in devstack, "nova migrate " will try to migrate to the same host (and then fail) Status in OpenStack Compute (nova): New Bug description: In multinode devstack I had an instance running on one node and tried running "nova migrate ". The operation started, but then the instance went into an error state with the following fault: {"message": "Unable to migrate instance (2bbdab8e- 3a83-43a4-8c47-ce57b653e43e) to current host (fedora-1.novalocal).", "code": 400, "created": "2019-03-08T19:59:09Z"} Logically, I think that even if "resize to same host" is enabled, for a "migrate" operation we should remove the current host from consideration. We know it's going to fail, and it doesn't make sense anyways. Also, it would probably make sense to make "migrate" work like "live migration" which removes the current host from consideration. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1819216/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1818701] [NEW] invalid PCI alias in flavor results in HTTP 500 on instance create
Public bug reported: If an invalid PCI alias is specified in the flavor extra-specs and we try to create an instance with that flavor, it will result in a PciInvalidAlias exception being raised. In ServersController.create() PciInvalidAlias is missing from the list of exceptions that get converted to an HTTPBadRequest. Instead, it's reported as a 500 error: [stack@fedora-1 nova]$ nova boot --flavor ds2G --image fedora29 --nic none --admin-pass fedora asdf3 ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-fec3face-4135-41fd-bc48-07957363ddae) ** Affects: nova Importance: Undecided Status: New ** Tags: api -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1818701 Title: invalid PCI alias in flavor results in HTTP 500 on instance create Status in OpenStack Compute (nova): New Bug description: If an invalid PCI alias is specified in the flavor extra-specs and we try to create an instance with that flavor, it will result in a PciInvalidAlias exception being raised. In ServersController.create() PciInvalidAlias is missing from the list of exceptions that get converted to an HTTPBadRequest. Instead, it's reported as a 500 error: [stack@fedora-1 nova]$ nova boot --flavor ds2G --image fedora29 --nic none --admin-pass fedora asdf3 ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-fec3face-4135-41fd-bc48-07957363ddae) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1818701/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1818092] [NEW] hypervisor check in _check_instance_has_no_numa() is broken
Public bug reported: In commit ae2e5650d "Fail to live migration if instance has a NUMA topology" there is a check against hypervisor_type. Unfortunately it tests against the value "obj_fields.HVType.KVM". Even when KVM is supported by qemu the libvirt driver will still report the hypervisor type as "QEMU". So we need to fix up the hypervisor type check otherwise we'll always fail the check. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: In Progress ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1818092 Title: hypervisor check in _check_instance_has_no_numa() is broken Status in OpenStack Compute (nova): In Progress Bug description: In commit ae2e5650d "Fail to live migration if instance has a NUMA topology" there is a check against hypervisor_type. Unfortunately it tests against the value "obj_fields.HVType.KVM". Even when KVM is supported by qemu the libvirt driver will still report the hypervisor type as "QEMU". So we need to fix up the hypervisor type check otherwise we'll always fail the check. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1818092/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1815949] [NEW] missing special-case libvirt exception during device detach
Public bug reported: In Pike a customer has run into the following issue: 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall._func' failed: libvirtError: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk15' not found 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall Traceback (most recent call last): 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 143, in _run_loop 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 363, in _func 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall result = f(*args, **kwargs) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 505, in _do_wait_and_retry_detach 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall _try_detach_device(config, persistent=False, host=host) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 467, in _try_detach_device 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall device=alternative_device_name) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall self.force_reraise() 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 451, in _try_detach_device 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall self.detach_device(conf, persistent=persistent, live=live) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 530, in detach_device 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall self._domain.detachDeviceFlags(device_xml, flags=flags) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall result = proxy_call(self._autowrap, f, *args, **kwargs) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall rv = execute(f, *args, **kwargs) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall six.reraise(c, e, tb) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall rv = meth(*args, **kwargs) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1217, in detachDeviceFlags 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall if ret == -1: raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self) 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall libvirtError: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk15' not found 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall Based on discussion with Melanie Witt, it seems likely that nova is missing a special-case in Guest.detach_device_with_retry(). It seems likely we need to modify the conditional at line 409 of virt/libvirt/guest.py to look like 'if errcode in (libvirt.VIR_ERR_OPERATION_FAILED, libvirt.VIR_ERR_INTERNAL_ERROR):' ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1815949 Title: missing special-case libvirt exception during device detach Status in OpenStack Compute (nova): New Bug description: In Pike a customer has run into the following issue: 2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall._func' failed: libvirtError:
[Yahoo-eng-team] [Bug 1792985] [NEW] strict NUMA memory allocation for 4K pages leads to OOM-killer
Public bug reported: We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which *is* limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1792985 Title: strict NUMA memory allocation for 4K pages leads to OOM-killer Status in OpenStack Compute (nova): New Bug description: We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which *is* limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1792985/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1792077] [NEW] problem specifying multiple "bus=scsi" block devices on nova boot
Public bug reported: I'm using devstack stable/rocky on ubuntu 16.04. When running this command nova boot --flavor m1.small --nic net-name=public --block-device source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove,bus=scsi --block-device source=blank,dest=volume,size=2,bootindex=1,shutdown=remove,bus=scsi --poll twovol the instance fails to boot with the error: libvirtError: unsupported configuration: Found duplicate drive address for disk with target name 'sda' controller='0' bus='0' target='0' unit='0' For some background information, this works: nova boot --flavor m1.small --nic net-name=public --block-device source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove,bus=scsi --poll onevol It also works if I have two block devices but don't specify "bus=scsi": nova boot --flavor m1.small --nic net-name=public --block-device source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove --block-device source=blank,dest=volume,size=2,bootindex=1,shutdown=remove --poll twovolnoscsi This maps to the following XML: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: f16cb93d-7bf0-4da7-a804-b9539d64576a Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: Sep 12 05:05:22 devstack nova-compute[3062]: 7d5de2b0-cb66-4607-a5f5-60fd40db51c3 Sep 12 05:05:22 devstack nova-compute[3062]: In the failure case, the nova-compute logs include the following interesting bits. Note the additional '' lines in the XML. Sep 12 04:48:43 devstack nova-compute[3062]: ERROR nova.virt.libvirt.guest [None req-a7c5f15c-1e44-4cd1-bf57-45b819676b20 admin admin] Error defining a guest with XML: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: 08561cc0-5cf2-4eb7-a3f9-956f945e6c24 Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: 007fac3d-8800-4f45-9531-e3bab5c86a1e Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: Sep 12 04:48:43 devstack nova-compute[3062]: : libvirtError: unsupported configuration: Found duplicate drive address for disk with target name 'sda' controller='0' bus='0' target='0' unit='0' Sep 12 04:48:43 devstack nova-compute[3062]: ERROR nova.virt.libvirt.driver [None req-a7c5f15c-1e44-4cd1-bf57-45b819676b20 admin admin] [instance: cf4f2c6f-7391-4a49-8f40-5e5cda98f78b] Failed to start libvirt guest: libvirtError: unsupported configuration: Found duplicate drive address for disk with target name 'sda' controller='0' bus='0' target='0' unit='0' Here is the libvirtd log in the failure case: 2018-09-12 04:48:43.312+: 16889: error : virDomainDefCheckDuplicateDriveAddresses:5747 : unsupported configuration: Found duplicate drive address for disk with target name 'sda' controller='0' bus='0' target='0' unit='0' ** Affects: nova Importance: Undecided Status: New ** Tags: compute ** Description changed: I'm using devstack stable/rocky on ubuntu 16.04. - When running the command "nova boot --flavor m1.small --nic net- - name=public --block-device + When running this command + + nova boot --flavor m1.small --nic net-name=public --block-device source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove,bus=scsi --block-device source=blank,dest=volume,size=2,bootindex=1,shutdown=remove,bus=scsi - --poll twovol" the instance fails to boot with the error "libvirtError: - unsupported configuration: Found duplicate drive address for disk with - target name 'sda' controller='0' bus='0' target='0' unit='0'" + --poll twovol + + the instance fails to boot with the error: + + libvirtError: unsupported configuration: Found duplicate drive address + for disk with target name 'sda' controller='0' bus='0' target='0' + unit='0' For some background information, this works: nova boot --flavor m1.small --nic net-name=public --block-device
[Yahoo-eng-team] [Bug 1790195] [NEW] performance problems starting up nova process due to regex code
Public bug reported: We noticed that nova process startup seems to take a long time. It looks like one major culprit is the regex code at https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py Sean K Mooney highlighted one possible culprit: i dont really like this https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py#L128-L142 def _get_all_chars(): for i in range(0x): yield six.unichr(i) so that is got to loop 65535 times *going too and we call the function 17 times so that 1.1 million callse to re.escape every time we load that module ** Affects: nova Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1790195 Title: performance problems starting up nova process due to regex code Status in OpenStack Compute (nova): New Bug description: We noticed that nova process startup seems to take a long time. It looks like one major culprit is the regex code at https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py Sean K Mooney highlighted one possible culprit: i dont really like this https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py#L128-L142 def _get_all_chars(): for i in range(0x): yield six.unichr(i) so that is got to loop 65535 times *going too and we call the function 17 times so that 1.1 million callse to re.escape every time we load that module To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1790195/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1785270] [NEW] allow confirmation of resize/migration for migrations in "confirming" status
Public bug reported: Confirmation of a resize is an RPC operation. If a compute node fails after a migration has been put into the "confirming" status there is no way to confirm it again, causing the state of the instance to get "stuck". In the case of confirm_resize(), I don't see any problem with allowing us to retry by sending another confirm_resize message. On the target compute node the actual confirmation is synchronized by instance.uuid, so there should be no races, and it already handles the "migration is already confirmed" case. The proposed code change would look something like this: @check_instance_state(vm_state=[vm_states.RESIZED]) def confirm_resize(self, context, instance, migration=None): """Confirms a migration/resize and deletes the 'old' instance.""" elevated = context.elevated() # NOTE(melwitt): We're not checking quota here because there isn't a # change in resource usage when confirming a resize. Resource # consumption for resizes are written to the database by compute, so # a confirm resize is just a clean up of the migration objects and a # state change in compute. if migration is None: -migration = objects.Migration.get_by_instance_and_status( -elevated, instance.uuid, 'finished') +# Look for migrations in confirming state as well as finished to +# handle cases where the confirm did not complete (eg. because +# the compute node went away during the confirm). +for status in ('finished', 'confirming'): +try: +migration = objects.Migration.get_by_instance_and_status( +elevated, instance.uuid, status) +break +except exception.MigrationNotFoundByStatus: +pass + +if migration is None: +raise exception.MigrationNotFoundByStatus( +instance_id=instance.uuid, status='finished|confirming') ** Affects: nova Importance: Low Status: Triaged ** Tags: resize -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1785270 Title: allow confirmation of resize/migration for migrations in "confirming" status Status in OpenStack Compute (nova): Triaged Bug description: Confirmation of a resize is an RPC operation. If a compute node fails after a migration has been put into the "confirming" status there is no way to confirm it again, causing the state of the instance to get "stuck". In the case of confirm_resize(), I don't see any problem with allowing us to retry by sending another confirm_resize message. On the target compute node the actual confirmation is synchronized by instance.uuid, so there should be no races, and it already handles the "migration is already confirmed" case. The proposed code change would look something like this: @check_instance_state(vm_state=[vm_states.RESIZED]) def confirm_resize(self, context, instance, migration=None): """Confirms a migration/resize and deletes the 'old' instance.""" elevated = context.elevated() # NOTE(melwitt): We're not checking quota here because there isn't a # change in resource usage when confirming a resize. Resource # consumption for resizes are written to the database by compute, so # a confirm resize is just a clean up of the migration objects and a # state change in compute. if migration is None: -migration = objects.Migration.get_by_instance_and_status( -elevated, instance.uuid, 'finished') +# Look for migrations in confirming state as well as finished to +# handle cases where the confirm did not complete (eg. because +# the compute node went away during the confirm). +for status in ('finished', 'confirming'): +try: +migration = objects.Migration.get_by_instance_and_status( +elevated, instance.uuid, status) +break +except exception.MigrationNotFoundByStatus: +pass + +if migration is None: +raise exception.MigrationNotFoundByStatus( +instance_id=instance.uuid, status='finished|confirming') To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1785270/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1785123] [NEW] UEFI NVRAM lost on cold migration or resize
Public bug reported: If you boot a virtual instance with UEFI, the UEFI NVRAM is lost on a cold migration. The default storage for the virtual UEFI NVRAM is in /var/lib/libvirt/qemu/nvram/, and the file is not being copied over on cold migration. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1785123 Title: UEFI NVRAM lost on cold migration or resize Status in OpenStack Compute (nova): New Bug description: If you boot a virtual instance with UEFI, the UEFI NVRAM is lost on a cold migration. The default storage for the virtual UEFI NVRAM is in /var/lib/libvirt/qemu/nvram/, and the file is not being copied over on cold migration. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1785123/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1785086] [NEW] docs for RPC is out of date
Public bug reported: The information in doc/source/reference/rpc.rst is stale and should probably be updated or removed so that it doesn't confuse people. ** Affects: nova Importance: Undecided Status: New ** Tags: docs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1785086 Title: docs for RPC is out of date Status in OpenStack Compute (nova): New Bug description: The information in doc/source/reference/rpc.rst is stale and should probably be updated or removed so that it doesn't confuse people. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1785086/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1781643] [NEW] With remote storage, swap disk size changed after resize-revert
Public bug reported: There seems to be an issue (discovered in Pike) where ceph-backed swap does not return to the original size if a resize operation is reverted. Steps to reproduce: 1) Configure compute nodes to use remote ceph-backed storage for instances. 2) Launch a vm with with ephemeral and swap disk. (The swap disk will be RBD-backed.) 3) Resize vm to a new flavor with larger swap disk size. The swap disk will be resized to the larger size. 4) Resize-revert to original flavor. 5) Check actual disk sizes from within the VM and from ceph directly. Expected behaviour: VM swap disk size should be reverted back to original size. Actual behaviour: VM swap disk remains at the larger size. ** Affects: nova Importance: Undecided Status: New ** Tags: ceph compute resize -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1781643 Title: With remote storage, swap disk size changed after resize-revert Status in OpenStack Compute (nova): New Bug description: There seems to be an issue (discovered in Pike) where ceph-backed swap does not return to the original size if a resize operation is reverted. Steps to reproduce: 1) Configure compute nodes to use remote ceph-backed storage for instances. 2) Launch a vm with with ephemeral and swap disk. (The swap disk will be RBD-backed.) 3) Resize vm to a new flavor with larger swap disk size. The swap disk will be resized to the larger size. 4) Resize-revert to original flavor. 5) Check actual disk sizes from within the VM and from ceph directly. Expected behaviour: VM swap disk size should be reverted back to original size. Actual behaviour: VM swap disk remains at the larger size. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1781643/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1538565] Re: Guest CPU does not support 1Gb hugepages with explicit models
The code at https://review.openstack.org/#/c/534384/ has been merged, and should allow the operator to explicitly add the pdpe1gb flag. Marking as fixed. ** Changed in: nova Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1538565 Title: Guest CPU does not support 1Gb hugepages with explicit models Status in OpenStack Compute (nova): Fix Released Bug description: The CPU flag pdpe1gb indicates that the CPU model supports 1 GB hugepages - without it, the Linux operating system refuses to allocate 1 GB huge pages (and other things might go wrong if it did). Not all Intel CPU models support 1 GB huge pages, so the qemu options -cpu Haswell and -cpu Broadwell give you a vCPU that does not have the pdpe1gb flag. This is the correct thing to do, since the VM might be running on a Haswell that does not have 1GB huge pages. Problem is that Nova flavor extra specs with the libvirt driver for qemu/kvm only allow to define the CPU model, either an explicit model or "host". The host option means that all CPU flags in the host CPU are passed to the vCPU. However, the host option complicates VM migration since the CPU would change after migration. In conclusion, there is no good way to specify a CPU model that would imply the pdpe1gb flag. Huge pages are used eg with dpdk. They improve the performance of the VM mainly by reducing tlb size. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1538565/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1764556] Re: "nova list" fails with exception.ServiceNotFound if service is deleted and has no UUID
I think we could get into the bad state described in the bug if we do a slightly different series of actions: 1) boot instance on Ocata 2) migrate instance 3) delete compute node (thus deleting the service record) 4) create compute node with same name 5) migrate instance to newly-created compute node 6) upgrade to Pike This should result in the deleted service not having a UUID, which will cause problems in Pike if we do a "nova list". I suppose an argument could be made that this is an unlikely scenario, which is probably true. :) ** Changed in: nova Status: Fix Released => New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1764556 Title: "nova list" fails with exception.ServiceNotFound if service is deleted and has no UUID Status in OpenStack Compute (nova): New Bug description: We had a testcase where we booted an instance on Newton, migrated it off the compute node, deleted the compute node (and service), upgraded to Pike, created a new compute node with the same name, and migrated the instance back to the compute node. At this point the "nova list" command failed with exception.ServiceNotFound. It appears that since the Service has no UUID the _from_db_object() routine will try to add it, but the service.save() call fails because the service in question has been deleted. I reproduced the issue with stable/pike devstack. I booted an instance, then created a fake entry in the "services" table without a UUID so the table looked like this: mysql> select * from services; +-+-+-++--++---+--+--+-+-+-+-+-+--+ | created_at | updated_at | deleted_at | id | host | binary | topic | report_count | disabled | deleted | disabled_reason | last_seen_up| forced_down | version | uuid | +-+-+-++--++---+--+--+-+-+-+-+-+--+ | 2018-02-20 16:10:07 | 2018-04-16 22:10:46 | NULL| 1 | devstack | nova-conductor | conductor | 477364 |0 | 0 | NULL| 2018-04-16 22:10:46 | 0 | 22 | c041d7cf-5047-4014-b50c-3ba6b5d95097 | | 2018-02-20 16:10:10 | 2018-04-16 22:10:54 | NULL| 2 | devstack | nova-compute | compute | 477149 |0 | 0 | NULL| 2018-04-16 22:10:54 | 0 | 22 | d0cfb63c-8b59-4b65-bb7e-6b89acd3fe35 | | 2018-02-20 16:10:10 | 2018-04-16 20:29:33 | 2018-04-16 20:30:33 | 3 | devstack | nova-compute | compute | 476432 |0 | 3 | NULL| 2018-04-16 20:30:33 | 0 | 22 | NULL | +-+-+-++--++---+--+--+-+-+-+-+-+--+ At this point, running "nova show " worked fine, but running "nova list" failed: stack@devstack:~/devstack$ nova list ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6) The nova-api log looked like this: Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG nova.compute.api [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Listing 1000 instances in cell 09eb515f-9906-40bf-9be6-63b5e6ee279a(cell1) {{(pid=4261) _get_instances_by_filters_all_cells /opt/stack/nova/nova/compute/api.py:2559}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s {{(pid=4261) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s {{(pid=4261) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG
[Yahoo-eng-team] [Bug 1764556] [NEW] "nova list" fails with exception.ServiceNotFound if service is deleted and has no UUID
Public bug reported: We had a testcase where we booted an instance on Newton, migrated it off the compute node, deleted the compute node (and service), upgraded to Pike, created a new compute node with the same name, and migrated the instance back to the compute node. At this point the "nova list" command failed with exception.ServiceNotFound. It appears that since the Service has no UUID the _from_db_object() routine will try to add it, but the service.save() call fails because the service in question has been deleted. I reproduced the issue with stable/pike devstack. I booted an instance, then created a fake entry in the "services" table without a UUID so the table looked like this: mysql> select * from services; +-+-+-++--++---+--+--+-+-+-+-+-+--+ | created_at | updated_at | deleted_at | id | host | binary | topic | report_count | disabled | deleted | disabled_reason | last_seen_up| forced_down | version | uuid | +-+-+-++--++---+--+--+-+-+-+-+-+--+ | 2018-02-20 16:10:07 | 2018-04-16 22:10:46 | NULL| 1 | devstack | nova-conductor | conductor | 477364 |0 | 0 | NULL| 2018-04-16 22:10:46 | 0 | 22 | c041d7cf-5047-4014-b50c-3ba6b5d95097 | | 2018-02-20 16:10:10 | 2018-04-16 22:10:54 | NULL| 2 | devstack | nova-compute | compute | 477149 |0 | 0 | NULL| 2018-04-16 22:10:54 | 0 | 22 | d0cfb63c-8b59-4b65-bb7e-6b89acd3fe35 | | 2018-02-20 16:10:10 | 2018-04-16 20:29:33 | 2018-04-16 20:30:33 | 3 | devstack | nova-compute | compute | 476432 |0 | 3 | NULL| 2018-04-16 20:30:33 | 0 | 22 | NULL | +-+-+-++--++---+--+--+-+-+-+-+-+--+ At this point, running "nova show " worked fine, but running "nova list" failed: stack@devstack:~/devstack$ nova list ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6) The nova-api log looked like this: Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG nova.compute.api [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Listing 1000 instances in cell 09eb515f-9906-40bf-9be6-63b5e6ee279a(cell1) {{(pid=4261) _get_instances_by_filters_all_cells /opt/stack/nova/nova/compute/api.py:2559}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s {{(pid=4261) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s {{(pid=4261) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG nova.objects.service [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Generated UUID 4368a7ff-f589-4197-b0b9-d2afdb71ca33 for service 3 {{(pid=4261) _from_db_object /opt/stack/nova/nova/objects/service.py:245}} Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR nova.api.openstack.extensions [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Unexpected exception in API method: ServiceNotFound: Service 3 could not be found. Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR nova.api.openstack.extensions Traceback (most recent call last): Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/api/openstack/extensions.py", line 336, in wrapped Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR nova.api.openstack.extensions return f(*args, **kwargs) Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR nova.api.openstack.extensions File
[Yahoo-eng-team] [Bug 1763766] [NEW] nova needs to disallow topology changes on image rebuild
Public bug reported: When doing a rebuild the assumption throughout the code is that we are not changing the resources consumed by the guest (that is what a resize is for). The complication here is that there are a number of image properties which might affect the instance resource consumption (in conjunction with a suitable flavor): hw_numa_nodes=X hw_numa_cpus.X=Y hw_numa_mem.X=Y hw_mem_page_size=X hw_cpu_thread_policy=X hw_cpu_policy=X Due to the assumptions made in the rest of the code, we need to add a check to ensure that on a rebuild the above image properties do not differ between the old and new images. While they might look suspicious, I think that the following image properties *should* be allowed to differ, since they only affect the topology seen by the guest: hw_cpu_threads hw_cpu_cores hw_cpu_sockets hw_cpu_max_threads hw_cpu_max_cores hw_cpu_max_sockets hw_cpu_realtime_mask ** Affects: nova Importance: Medium Status: Triaged ** Affects: nova/ocata Importance: Medium Status: Confirmed ** Affects: nova/pike Importance: Medium Status: Confirmed ** Affects: nova/queens Importance: Medium Status: Confirmed ** Tags: compute rebuild -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1763766 Title: nova needs to disallow topology changes on image rebuild Status in OpenStack Compute (nova): Triaged Status in OpenStack Compute (nova) ocata series: Confirmed Status in OpenStack Compute (nova) pike series: Confirmed Status in OpenStack Compute (nova) queens series: Confirmed Bug description: When doing a rebuild the assumption throughout the code is that we are not changing the resources consumed by the guest (that is what a resize is for). The complication here is that there are a number of image properties which might affect the instance resource consumption (in conjunction with a suitable flavor): hw_numa_nodes=X hw_numa_cpus.X=Y hw_numa_mem.X=Y hw_mem_page_size=X hw_cpu_thread_policy=X hw_cpu_policy=X Due to the assumptions made in the rest of the code, we need to add a check to ensure that on a rebuild the above image properties do not differ between the old and new images. While they might look suspicious, I think that the following image properties *should* be allowed to differ, since they only affect the topology seen by the guest: hw_cpu_threads hw_cpu_cores hw_cpu_sockets hw_cpu_max_threads hw_cpu_max_cores hw_cpu_max_sockets hw_cpu_realtime_mask To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1763766/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1552777] Re: resizing from flavor with swap to one without swap puts instance into Error status
** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1552777 Title: resizing from flavor with swap to one without swap puts instance into Error status Status in OpenStack Compute (nova): Fix Released Bug description: In a single-node devstack (current trunk, nova commit 6e1051b7), if you boot an instance with a flavor that has nonzero swap and then resize to a flavor with zero swap it causes an exception. It seems that we somehow neglect to remove the swap file from the instance. 2016-03-03 10:02:41.415 ERROR nova.virt.libvirt.guest [req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] Error launching a defined domain with XML: instance-0001 54711b56-fa72-4eac-a5d3-aa29ed128098 http://openstack.org/xmlns/libvirt/nova/1.0;> asdf 2016-03-03 16:02:39 512 1 0 0 1 admin alt_demo 524288 524288 1 1024 OpenStack Foundation OpenStack Nova 13.0.0 03000200-0400-0500-0006-000700080009 54711b56-fa72-4eac-a5d3-aa29ed128098 Virtual Machine hvm /opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/kernel /opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/ramdisk root=/dev/vda console=tty0 console=ttyS0 destroy restart destroy /usr/bin/kvm-spice 2016-03-03 10:02:41.417 ERROR nova.compute.manager [req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] Setting instance vm_state to ERROR 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] Traceback (most recent call last): 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/compute/manager.py", line 3999, in finish_resize 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] disk_info, image_meta) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/compute/manager.py", line 3964, in _finish_resize 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] old_instance_type) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__ 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] six.reraise(self.type_, self.value, self.tb) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/compute/manager.py", line 3959, in _finish_resize 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] block_device_info, power_on) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 7202, in finish_migration 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] vifs_already_plugged=True) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4862, in _create_domain_and_network 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] xml, pause=pause, power_on=power_on) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4793, in _create_domain 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] guest.launch(pause=pause) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098]
[Yahoo-eng-team] [Bug 1756179] [NEW] deleting a nova-compute service leaves orphaned records in placement and host mapping
Public bug reported: Currently when deleting a nova-compute service via the API, we will delete the service and compute_node records in the DB, but the placement resource provider and host mapping records will be orphaned. The orphaned resource provider records have been found to cause scheduler failures if you re-create the compute node with the same name (but a different UUID). It has been theorized that the stale host mapping records could end up pointing at the wrong cell. In discussions on IRC (http://eavesdrop.openstack.org/irclogs /%23openstack-nova/%23openstack- nova.2018-03-15.log.html#t2018-03-15T19:30:13) it was proposed that we should 1. delete the RP in placement 2. delete the host mapping 3. delete the service/node Optionally we could delete the compute node prior to deleting the service to make it explicit and because the ordering is slightly more logical, but this is not a requirement since it will be done implicitly as part of deleting the service. ** Affects: nova Importance: Medium Status: Triaged ** Affects: nova/pike Importance: Undecided Status: New ** Affects: nova/queens Importance: Undecided Status: New ** Tags: api cells placement ** Summary changed: - deleting a nova-compute service leaves orphaned records in placement + deleting a nova-compute service leaves orphaned records in placement and host mapping -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1756179 Title: deleting a nova-compute service leaves orphaned records in placement and host mapping Status in OpenStack Compute (nova): Triaged Status in OpenStack Compute (nova) pike series: New Status in OpenStack Compute (nova) queens series: New Bug description: Currently when deleting a nova-compute service via the API, we will delete the service and compute_node records in the DB, but the placement resource provider and host mapping records will be orphaned. The orphaned resource provider records have been found to cause scheduler failures if you re-create the compute node with the same name (but a different UUID). It has been theorized that the stale host mapping records could end up pointing at the wrong cell. In discussions on IRC (http://eavesdrop.openstack.org/irclogs /%23openstack-nova/%23openstack- nova.2018-03-15.log.html#t2018-03-15T19:30:13) it was proposed that we should 1. delete the RP in placement 2. delete the host mapping 3. delete the service/node Optionally we could delete the compute node prior to deleting the service to make it explicit and because the ordering is slightly more logical, but this is not a requirement since it will be done implicitly as part of deleting the service. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1756179/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1755981] [NEW] powering off and on an instance can result in instance boot failure due to serial port handling race
Public bug reported: The following is specific to the libvirt driver. When we call power_off() it calls _destroy(), which in turn calls self._get_serial_ports_from_guest() and loops over all the serial ports calling serial_console.release_port() on each. This removes the host TCP port from ALLOCATED_PORTS (which is the set of allocated ports on the host). Then when we call power_on(), it again calls _destroy(), which again calls self._get_serial_ports_from_guest(). This will return the same set of ports that it did before. This is a problem, because those ports could have been allocated to another instance in the meantime! So in the case where one or more of those ports had been allocated to another instance, we call serial_console.release_port() on them, and remove them from ALLOCATED_PORTS. Then as part of power_on() we will create new XML with new serial ports, which could select the ports that we just removed from ALLOCATED_PORTS (which are actually in use by another instance). When qemu tries to bind to this port it will fail, causing the instance to error out and stay in the SHUTOFF state. One possible solution would be to call guest.detach_device() on the "serial" and "console" devices from the guest in the power_off() routine. That way when we call _destroy() in the power_on() routine there wouldn't be any devices returned by _get_serial_ports_from_guest(). This is a bit messy though, so if anyone has any better ideas I'd like to hear about it. ** Affects: nova Importance: Undecided Status: New ** Tags: compute libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1755981 Title: powering off and on an instance can result in instance boot failure due to serial port handling race Status in OpenStack Compute (nova): New Bug description: The following is specific to the libvirt driver. When we call power_off() it calls _destroy(), which in turn calls self._get_serial_ports_from_guest() and loops over all the serial ports calling serial_console.release_port() on each. This removes the host TCP port from ALLOCATED_PORTS (which is the set of allocated ports on the host). Then when we call power_on(), it again calls _destroy(), which again calls self._get_serial_ports_from_guest(). This will return the same set of ports that it did before. This is a problem, because those ports could have been allocated to another instance in the meantime! So in the case where one or more of those ports had been allocated to another instance, we call serial_console.release_port() on them, and remove them from ALLOCATED_PORTS. Then as part of power_on() we will create new XML with new serial ports, which could select the ports that we just removed from ALLOCATED_PORTS (which are actually in use by another instance). When qemu tries to bind to this port it will fail, causing the instance to error out and stay in the SHUTOFF state. One possible solution would be to call guest.detach_device() on the "serial" and "console" devices from the guest in the power_off() routine. That way when we call _destroy() in the power_on() routine there wouldn't be any devices returned by _get_serial_ports_from_guest(). This is a bit messy though, so if anyone has any better ideas I'd like to hear about it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1755981/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1754782] [NEW] we skip critical scheduler filters when forcing the host on instance boot
Public bug reported: When booting an instance it's possible to force it to be placed on a specific host using the "--availability-zone nova:host" syntax. If you do this, the code at https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L581 will return early rather than call self.filter_handler.get_filtered_objects() Based on discussions at the PTG with Dan Smith, the simplest solution would be to create a flag similar to RUN_ON_REBUILD which would be applied to the various scheduler filters in a manner analogous to how rebuild is handled now. Presumably we'd want to call something like this during the instance boot code to ensure we hit the existing "if not check_type" at L581: request_spec.scheduler_hints['_nova_check_type'] = ['build'] Then in the various critical filers (NUMATopologyFilter for example, and PciPassthroughFilter, and maybe some others like ComputeFilter) we could define something like "RUN_ON_BUILD = True" to ensure that they run even when forcing a host. ** Affects: nova Importance: Undecided Status: New ** Tags: compute scheduler -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1754782 Title: we skip critical scheduler filters when forcing the host on instance boot Status in OpenStack Compute (nova): New Bug description: When booting an instance it's possible to force it to be placed on a specific host using the "--availability-zone nova:host" syntax. If you do this, the code at https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L581 will return early rather than call self.filter_handler.get_filtered_objects() Based on discussions at the PTG with Dan Smith, the simplest solution would be to create a flag similar to RUN_ON_REBUILD which would be applied to the various scheduler filters in a manner analogous to how rebuild is handled now. Presumably we'd want to call something like this during the instance boot code to ensure we hit the existing "if not check_type" at L581: request_spec.scheduler_hints['_nova_check_type'] = ['build'] Then in the various critical filers (NUMATopologyFilter for example, and PciPassthroughFilter, and maybe some others like ComputeFilter) we could define something like "RUN_ON_BUILD = True" to ensure that they run even when forcing a host. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1754782/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1538565] Re: Guest CPU does not support 1Gb hugepages with explicit models
In recent versions of qemu the "Skylake-Server" cpu model has the flag, but any earlier Intel processor models do not. ** Changed in: nova Status: Expired => Confirmed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1538565 Title: Guest CPU does not support 1Gb hugepages with explicit models Status in OpenStack Compute (nova): Confirmed Bug description: The CPU flag pdpe1gb indicates that the CPU model supports 1 GB hugepages - without it, the Linux operating system refuses to allocate 1 GB huge pages (and other things might go wrong if it did). Not all Intel CPU models support 1 GB huge pages, so the qemu options -cpu Haswell and -cpu Broadwell give you a vCPU that does not have the pdpe1gb flag. This is the correct thing to do, since the VM might be running on a Haswell that does not have 1GB huge pages. Problem is that Nova flavor extra specs with the libvirt driver for qemu/kvm only allow to define the CPU model, either an explicit model or "host". The host option means that all CPU flags in the host CPU are passed to the vCPU. However, the host option complicates VM migration since the CPU would change after migration. In conclusion, there is no good way to specify a CPU model that would imply the pdpe1gb flag. Huge pages are used eg with dpdk. They improve the performance of the VM mainly by reducing tlb size. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1538565/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1750623] [NEW] rebuild to same host with different image shouldn't check with placement
Public bug reported: When doing a rebuild-to-same-host but with a different image, all we really want to do is ensure that the image properties for the new image are still valid for the current host. Accordingly we need to go through the scheduler (to run the image-related filters) but we don't want to do anything related to resource consumption. Currently the scheduler will contact placement to get a pre-filtered list of compute nodes with sufficient free resources for the instance in question. If the instance is on a compute node that is close to full, this may result in the current compute node being filtered out of the list, which will result in a noValidHost exception. Ideally, in the case where we are doing a rebuild-to-same-host we would simply retrieve the information for the current compute node from the DB instead of from placement, and then run the image-related scheduler filters. ** Affects: nova Importance: Undecided Status: New ** Tags: rebuild scheduler -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1750623 Title: rebuild to same host with different image shouldn't check with placement Status in OpenStack Compute (nova): New Bug description: When doing a rebuild-to-same-host but with a different image, all we really want to do is ensure that the image properties for the new image are still valid for the current host. Accordingly we need to go through the scheduler (to run the image-related filters) but we don't want to do anything related to resource consumption. Currently the scheduler will contact placement to get a pre-filtered list of compute nodes with sufficient free resources for the instance in question. If the instance is on a compute node that is close to full, this may result in the current compute node being filtered out of the list, which will result in a noValidHost exception. Ideally, in the case where we are doing a rebuild-to-same-host we would simply retrieve the information for the current compute node from the DB instead of from placement, and then run the image-related scheduler filters. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1750623/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1750618] [NEW] rebuild to same host with a different image results in erroneously doing a Claim
Public bug reported: As of stable/pike if we do a rebuild-to-same-node with a new image, it results in ComputeManager.rebuild_instance() being called with "scheduled_node=" and "recreate=False". This results in a new Claim, which seems wrong since we're not changing the flavor and that claim could fail if the compute node is already full. The comments in ComputeManager.rebuild_instance() make it appear that it expects both "recreate" and "scheduled_node" to be None for the rebuild- to-same-host case otherwise it will do a Claim. However, if we rebuild to a different image it ends up going through the scheduler which means that "scheduled_node" is not None. ** Affects: nova Importance: Undecided Status: New ** Tags: compute rebuild -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1750618 Title: rebuild to same host with a different image results in erroneously doing a Claim Status in OpenStack Compute (nova): New Bug description: As of stable/pike if we do a rebuild-to-same-node with a new image, it results in ComputeManager.rebuild_instance() being called with "scheduled_node=" and "recreate=False". This results in a new Claim, which seems wrong since we're not changing the flavor and that claim could fail if the compute node is already full. The comments in ComputeManager.rebuild_instance() make it appear that it expects both "recreate" and "scheduled_node" to be None for the rebuild-to-same-host case otherwise it will do a Claim. However, if we rebuild to a different image it ends up going through the scheduler which means that "scheduled_node" is not None. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1750618/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1605098] Re: Nova usage not showing server real uptime
Nova reserves resources for the instance even if it's not running, so the reported uptime probably shouldn't be used for billing. Also, the uptime gets reset on a resize/revert-resize/rescue, further making it tricky to use for billing. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1605098 Title: Nova usage not showing server real uptime Status in OpenStack Compute (nova): Invalid Bug description: Hi All, I am trying to calculate openstack server "uptime" where nova os usage is giving server creation time, which cant take forward for billing, Is there any way to do ? To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1605098/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1734394] [NEW] nova microversion 2.36 accidentally removed support for "force" when setting quotas
Public bug reported: It is supposed to be possible to specify the "force" option when updating a quota-set. Up to microversion 2.35 this works as expected. However, in 2.36 it no longer works, and nova-api sends back: RESP BODY: {"badRequest": {"message": "Invalid input for field/attribute quota_set. Value: {u'cores': 95, u'force': True}. Additional properties are not allowed (u'force' was unexpected)", "code": 400}} The problem seems to be that in schemas/quota_sets.py the "force" parameter is not in quota_resources, but rather is added to "update_quota_set". When creating update_quota_set_v236 they copied quota_resources instead of copying update_quota_set, and this meant that they lost the support for the "force" parameter. ** Affects: nova Importance: Undecided Status: New ** Tags: api ocata-backport-potential pike-backport-potential quotas ** Tags added: ocata-backport-potential pike-backport-potential quotas -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1734394 Title: nova microversion 2.36 accidentally removed support for "force" when setting quotas Status in OpenStack Compute (nova): New Bug description: It is supposed to be possible to specify the "force" option when updating a quota-set. Up to microversion 2.35 this works as expected. However, in 2.36 it no longer works, and nova-api sends back: RESP BODY: {"badRequest": {"message": "Invalid input for field/attribute quota_set. Value: {u'cores': 95, u'force': True}. Additional properties are not allowed (u'force' was unexpected)", "code": 400}} The problem seems to be that in schemas/quota_sets.py the "force" parameter is not in quota_resources, but rather is added to "update_quota_set". When creating update_quota_set_v236 they copied quota_resources instead of copying update_quota_set, and this meant that they lost the support for the "force" parameter. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1734394/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1724686] [NEW] authentication code hangs when there are three or more admin keystone endpoints
Public bug reported: I'm running stable/pike devstack, and I was playing around with what happens when there are many endpoints in multiple regions, and I stumbled over a scenario where the keystone authentication code hangs. My original endpoint list looked like this: ubuntu@devstack:/opt/stack/devstack$ openstack endpoint list +--+---+--+-+-+---+--+ | ID | Region| Service Name | Service Type | Enabled | Interface | URL | +--+---+--+-+-+---+--+ | 0a9979ebfdbf48ce91ccf4e2dd952c1a | RegionOne | kingbird | synchronization | True| internal | http://127.0.0.1:8118/v1.0 | | 11d5507afe2a4eddb4f030695699114f | RegionOne | placement| placement | True| public| http://128.224.186.226/placement | | 1e42cf139398405188755b7e00aecb4d | RegionOne | keystone | identity | True| admin | http://128.224.186.226/identity | | 2daf99edecae4afba88bb58233595481 | RegionOne | glance | image | True| public| http://128.224.186.226/image | | 2ece52e8bbb34d47b9bd5611f5959385 | RegionOne | kingbird | synchronization | True| admin | http://127.0.0.1:8118/v1.0 | | 4835a089666a4b03bd2f499457ade6c2 | RegionOne | kingbird | synchronization | True| public| http://127.0.0.1:8118/v1.0 | | 78e9fbc0a47642268eda3e3576920f37 | RegionOne | nova | compute | True| public| http://128.224.186.226/compute/v2.1 | | 96a1e503dc0e4520a190b01f6a0cf79c | RegionOne | keystone | identity | True| public| http://128.224.186.226/identity | | a1887dbc8c5e4af5b4a6dc5ce224b8ff | RegionOne | cinderv2 | volumev2 | True| public| http://128.224.186.226/volume/v2/$(project_id)s | | b7d5938141694a4c87adaed5105ea3ab | RegionOne | cinder | volume | True| public| http://128.224.186.226/volume/v1/$(project_id)s | | bb169382cbea4715964e4652acd48070 | RegionOne | nova_legacy | compute_legacy | True| public| http://128.224.186.226/compute/v2/$(project_id)s | | e01c8d8e08874d61b9411045a99d4860 | RegionOne | neutron | network | True| public| http://128.224.186.226:9696/ | | f94c96ed474249a29a6c0a1bb2b2e500 | RegionOne | cinderv3 | volumev3 | True| public| http://128.224.186.226/volume/v3/$(project_id)s | +--+---+--+-+-+---+--+ I was able to successfully run the following python code: from keystoneauth1 import loading from keystoneauth1 import loading from keystoneauth1 import session from keystoneclient.v3 import client loader = loading.get_plugin_loader("password") auth = loader.load_from_options(username='admin',password='secret',project_name='admin',auth_url='http://128.224.186.226/identity') sess = session.Session(auth=auth) keystone = client.Client(session=sess) keystone.services.list() I then duplicated all of the endpoints in a new region "region2", and was able to run the python code. When I duplicated all the endpoints again in a new region "region3" (for a total of 39 endpoints) the python code hung at the final line. Removing all the "region3" endpoints allowed the python code to work again. During all of this the command "openstack endpoint list" worked fine. Further testing seems to indicate that it is the third "admin" keystone endpoint that is causing the problem. I can add multiple "public" keystone endpoints, but three or more "admin" keystone endpoints cause the python code to hang. ** Affects: keystone Importance: Undecided Status: New ** Summary changed: - authentication code hangs when there are many endpoints + authentication code hangs when there are three or more admin keystone endpoints ** Description changed: I'm running stable/pike devstack, and I was playing around with what happens when there are many endpoints in multiple regions, and I stumbled over a scenario where the keystone authentication code hangs. My original endpoint list looked like this: ubuntu@devstack:/opt/stack/devstack$ openstack endpoint list +--+---+--+-+-+---+--+ | ID | Region| Service Name | Service Type | Enabled | Interface | URL |
[Yahoo-eng-team] [Bug 1284719] Re: buggy live migration rollback when using shared storage
** Changed in: nova Status: Expired => Incomplete -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1284719 Title: buggy live migration rollback when using shared storage Status in OpenStack Compute (nova): Incomplete Bug description: I'm running the current Icehouse code in devstack. I was looking at the code and noticed something suspicious. It looks like if we try to migrate a shared-storage instance and fail and end up rolling back we could end up with messed-up networking on the destination host. When setting up a live migration we unconditionally run ComputeManager.pre_live_migration() on the destination host to do various things including setting up networks on the host. If something goes wrong with the live migration in ComputeManager._rollback_live_migration() we will only call self.compute_rpcapi.rollback_live_migration_at_destination() if we're doing block migration or volume-backed migration that isn't shared storage. However, looking at ComputeManager.rollback_live_migration_at_destination(), I also see it cleaning up networking as well as block device. If we never call that cleanup code, then the networking stuff that was done in pre_live_migration() won't get rolled back. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1284719/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1712684] [NEW] allocations not immediately removed when instance is deleted
Public bug reported: Based on code inspection and a discussion with mriedem on IRC, it appears that when deleting an instance in a pure-Pike cloud the allocations are not removed until the update_available_resource() periodic task calls ResourceTracker._update_usage_from_instances(), which calls _remove_deleted_instances_allocations(). In a mixed Ocata/Pike cloud the allocation will be freed up immediately when _update_usage_from_instance() calls self.reportclient.update_instance_allocation(). In the ServerMovingTests functional test we bypass this by forcing the periodic task to run before checking that the allocations have been removed. ** Affects: nova Importance: High Status: Triaged ** Tags: compute pike-rc-potential placement scheduler ** Summary changed: - allocations not immediately removed when instance deleted + allocations not immediately removed when instance is deleted -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1712684 Title: allocations not immediately removed when instance is deleted Status in OpenStack Compute (nova): Triaged Bug description: Based on code inspection and a discussion with mriedem on IRC, it appears that when deleting an instance in a pure-Pike cloud the allocations are not removed until the update_available_resource() periodic task calls ResourceTracker._update_usage_from_instances(), which calls _remove_deleted_instances_allocations(). In a mixed Ocata/Pike cloud the allocation will be freed up immediately when _update_usage_from_instance() calls self.reportclient.update_instance_allocation(). In the ServerMovingTests functional test we bypass this by forcing the periodic task to run before checking that the allocations have been removed. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1712684/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1695991] [NEW] "nova-manage db online_data_migrations" doesn't report matched/migrated properly
Public bug reported: When running "nova-manage db online_data_migrations", it will report how many items matched the query and how many of the matching items were migrated. However, most of the migration routines are not properly reporting the "total matched" count when "max_count" is specified. This makes it difficult to know whether you have to call it again or not when specifying "--max-count" explicitly. Take for example Flavor.migrate_flavors(). This limits the value of main_db_ids to a max of "count": main_db_ids = _get_main_db_flavor_ids(ctxt, count) count_all = len(main_db_ids) return count_all, count_hit If someone sees that there were 50 items total and 50 items were converted, they may think that all the work is done. It would be better to call _get_main_db_flavor_ids() with no limit to the number of matches, and apply the limit to the number of conversions. Alternately, we should document that if --max-count is used then "nova- manage db online_data_migrations" should be called multiple times until *no* matches are reported and we can basically ignore the number of hits. (Or until no hits are reported, which would more closely align with the code in the case that max-count isn't specified explicitly.) ** Affects: nova Importance: Undecided Status: New ** Description changed: When running "nova-manage db online_data_migrations", it will report how many items matched the query and how many of the matching items were migrated. However, most of the migration routines are not properly reporting the "total matched" count when "max_count" is specified. This makes it difficult to know whether you have to call it again or not when specifying "--max-count" explicitly. Take for example Flavor.migrate_flavors(). This limits the value of main_db_ids to a max of "count": - main_db_ids = _get_main_db_flavor_ids(ctxt, count) + main_db_ids = _get_main_db_flavor_ids(ctxt, count) count_all = len(main_db_ids) return count_all, count_hit - - If someone sees that there were 50 items total and 50 items were converted, they may think that all the work is done. It would be better to call _get_main_db_flavor_ids() with no limit to the number of matches, and apply the limit to the number of conversions. + If someone sees that there were 50 items total and 50 items were + converted, they may think that all the work is done. It would be better + to call _get_main_db_flavor_ids() with no limit to the number of + matches, and apply the limit to the number of conversions. Alternately, we should document that if --max-count is used then "nova- manage db online_data_migrations" should be called multiple times until *no* matches are reported and we can basically ignore the number of - hits. + hits. (Or until no hits are reported, which would more closely align + with the code in the case that max-count isn't specified explicitly.) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1695991 Title: "nova-manage db online_data_migrations" doesn't report matched/migrated properly Status in OpenStack Compute (nova): New Bug description: When running "nova-manage db online_data_migrations", it will report how many items matched the query and how many of the matching items were migrated. However, most of the migration routines are not properly reporting the "total matched" count when "max_count" is specified. This makes it difficult to know whether you have to call it again or not when specifying "--max-count" explicitly. Take for example Flavor.migrate_flavors(). This limits the value of main_db_ids to a max of "count": main_db_ids = _get_main_db_flavor_ids(ctxt, count) count_all = len(main_db_ids) return count_all, count_hit If someone sees that there were 50 items total and 50 items were converted, they may think that all the work is done. It would be better to call _get_main_db_flavor_ids() with no limit to the number of matches, and apply the limit to the number of conversions. Alternately, we should document that if --max-count is used then "nova-manage db online_data_migrations" should be called multiple times until *no* matches are reported and we can basically ignore the number of hits. (Or until no hits are reported, which would more closely align with the code in the case that max-count isn't specified explicitly.) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1695991/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1695965] [NEW] "nova-manage db online_data_migrations" exit code is strange
Public bug reported: If I'm reading the code right, the exit value for "nova-manage db online_data_migrations" will be 1 if we actually performed some migrations and 0 if we performed no migrations, either because there were no remaining migrations or because the migration code raised an exception. This seems less than useful for someone attempting to script repeated calls to this with --max-count set. The caller needs to parse the output to determine whether or not it was successful. I think it would make more sense to have the exit code as follows: 0 -- no errors and completed 1 -- one of the migrations raised an exception, needs manual action 3 -- no errors but not yet complete, need to call again since it would allow for an automated retry based solely on the exit code. At the very least, the exit code should be nonzero for the case where one of the migrations raised an exception, and 0 for the case where no exception was raised. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1695965 Title: "nova-manage db online_data_migrations" exit code is strange Status in OpenStack Compute (nova): New Bug description: If I'm reading the code right, the exit value for "nova-manage db online_data_migrations" will be 1 if we actually performed some migrations and 0 if we performed no migrations, either because there were no remaining migrations or because the migration code raised an exception. This seems less than useful for someone attempting to script repeated calls to this with --max-count set. The caller needs to parse the output to determine whether or not it was successful. I think it would make more sense to have the exit code as follows: 0 -- no errors and completed 1 -- one of the migrations raised an exception, needs manual action 3 -- no errors but not yet complete, need to call again since it would allow for an automated retry based solely on the exit code. At the very least, the exit code should be nonzero for the case where one of the migrations raised an exception, and 0 for the case where no exception was raised. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1695965/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1691780] [NEW] port id is incorrectly logged in _update_port_binding_for_instance
Public bug reported: At line 2484 of https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py the code is accessing p[‘id’] in the LOG.info block, but that means it logs the last entry that it iterated over in the previous loop over the ports rather than the port_id being processed in the current loop. We see this when we have multiple ports, it suggests it is updating the same port over and over, when its actually working properly 2017-05-10 16:39:32.936 72563 INFO nova.network.neutronv2.api [req- 56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e 7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27 -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'} 2017-05-10 16:39:33.905 72563 INFO nova.network.neutronv2.api [req- 56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e 7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27 -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'} 2017-05-10 16:39:35.084 72563 INFO nova.network.neutronv2.api [req- 56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e 7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27 -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'} ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: In Progress ** Tags: neutron -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1691780 Title: port id is incorrectly logged in _update_port_binding_for_instance Status in OpenStack Compute (nova): In Progress Bug description: At line 2484 of https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py the code is accessing p[‘id’] in the LOG.info block, but that means it logs the last entry that it iterated over in the previous loop over the ports rather than the port_id being processed in the current loop. We see this when we have multiple ports, it suggests it is updating the same port over and over, when its actually working properly 2017-05-10 16:39:32.936 72563 INFO nova.network.neutronv2.api [req- 56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e 7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27 -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'} 2017-05-10 16:39:33.905 72563 INFO nova.network.neutronv2.api [req- 56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e 7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27 -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'} 2017-05-10 16:39:35.084 72563 INFO nova.network.neutronv2.api [req- 56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e 7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27 -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'} To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1691780/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1690890] [NEW] error message not clear for shared live migration with block storage
Public bug reported: When using an older microversion (2.25 or earlier) with boot-from-image, and the user forgets to specify block-migration, the error message returned is this: "Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk." This has a couple things wrong with it. First, the triple-negative is a bit confusing, especially for non-native-english speakers. Second, it implies that you cannot do a block migration, which is obviously false. I think a more clear error message would be something like: "Shared storage migration requires either shared storage or boot-from- volume with no local disks." ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: In Progress ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1690890 Title: error message not clear for shared live migration with block storage Status in OpenStack Compute (nova): In Progress Bug description: When using an older microversion (2.25 or earlier) with boot-from- image, and the user forgets to specify block-migration, the error message returned is this: "Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk." This has a couple things wrong with it. First, the triple-negative is a bit confusing, especially for non-native-english speakers. Second, it implies that you cannot do a block migration, which is obviously false. I think a more clear error message would be something like: "Shared storage migration requires either shared storage or boot-from- volume with no local disks." To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1690890/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1688673] [NEW] cpu_realtime_mask handling is not intuitive
Public bug reported: The nova code implicitly assumes that all vCPUs are realtime in nova.virt.hardware.vcpus_realtime_topology(), and then it appends the user-specified mask. This only makes sense if the user-specified cpu_realtime_mask is an exclusion mask, but this isn't documented anywhere. It would make more sense to simply use the mask as passed-in from the end-user. In order to preserve the current behaviour we should probably special- case the scenario where the passed-in cpu_realtime_mask starts with a "^" (indicating an exclusion). ** Affects: nova Importance: Undecided Status: New ** Tags: compute ** Description changed: The nova code implicitly assumes that all vCPUs are realtime in - nova.virt.hardware.vcpus_realtime_topology(). + nova.virt.hardware.vcpus_realtime_topology(), and then it appends the + user-specified mask. - This only makes sense if the cpu_realtime_mask is an exclusion mask, but - this isn't documented anywhere. + This only makes sense if the user-specified cpu_realtime_mask is an + exclusion mask, but this isn't documented anywhere. It would make more sense to simply use the mask as passed-in from the end-user. In order to preserve the current behaviour we should probably special- case the scenario where the passed-in cpu_realtime_mask starts with a "^" (indicating an exclusion). -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1688673 Title: cpu_realtime_mask handling is not intuitive Status in OpenStack Compute (nova): New Bug description: The nova code implicitly assumes that all vCPUs are realtime in nova.virt.hardware.vcpus_realtime_topology(), and then it appends the user-specified mask. This only makes sense if the user-specified cpu_realtime_mask is an exclusion mask, but this isn't documented anywhere. It would make more sense to simply use the mask as passed-in from the end-user. In order to preserve the current behaviour we should probably special- case the scenario where the passed-in cpu_realtime_mask starts with a "^" (indicating an exclusion). To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1688673/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1688599] [NEW] resource audit races against evacuating instance
Public bug reported: We recently hit an issue where an evacuating instance with dedicated cpu_policy being pinned to same host CPUs as other instances with dedicated cpu_policy. During subsequent resource audits we would see cpu pinning errors. The root cause appears to be the fact that the resource audit skips the evacuating instance during migration phase of audit while instance was rebuilding on new host. It appears that _instance_in_resize_state() returned "false" because the vm_state was vm_states.ERROR. We allow rebuilding from the ERROR state though, so we should consider it. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1688599 Title: resource audit races against evacuating instance Status in OpenStack Compute (nova): New Bug description: We recently hit an issue where an evacuating instance with dedicated cpu_policy being pinned to same host CPUs as other instances with dedicated cpu_policy. During subsequent resource audits we would see cpu pinning errors. The root cause appears to be the fact that the resource audit skips the evacuating instance during migration phase of audit while instance was rebuilding on new host. It appears that _instance_in_resize_state() returned "false" because the vm_state was vm_states.ERROR. We allow rebuilding from the ERROR state though, so we should consider it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1688599/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1687067] [NEW] problems with cpu and cpu-thread policy where flavor/image specify different settings
Public bug reported: There are a number of issues related to CPU policy and CPU thread policy where the flavor extra-spec and image properties do not match up. The docs at https://docs.openstack.org/admin-guide/compute-cpu- topologies.html say the following: "Image metadata takes precedence over flavor extra specs. Thus, configuring competing policies causes an exception. By setting a shared policy through image metadata, administrators can prevent users configuring CPU policies in flavors and impacting resource utilization." For the CPU policy this is exactly backwards based on the code. The flavor is specified by the admin, and so it generally takes priority over the image which is specified by the end user. If the flavor specifies "dedicated" then the result is dedicated regardless of what the image specifies. If the flavor specifies "shared" then the result depends on the image--if it specifies "dedicated" then we will raise an exception, otherwise we use "shared". If the flavor doesn't specify a CPU policy then the image can specify whatever policy it wants. The issue around CPU threading policy is more complicated. Back in Mitaka, if the flavor specified a CPU threading policy of either None or "prefer" then we would use the threading policy specified by the image (if it was set). If the flavor specified a CPU threading policy of "isolate" or "require" and the image specified a different CPU threading policy then we raised exception.ImageCPUThreadPolicyForbidden(), otherwise we used the CPU threading policy specified by the flavor. This behaviour is described in the spec at https://specs.openstack.org/openstack/nova- specs/specs/mitaka/implemented/virt-driver-cpu-thread-pinning.html In git commit 24997343 (which went into Newton) Nikola Dipanov made a code change that doesn't match the intent in the git commit message: if flavor_thread_policy in [None, fields.CPUThreadAllocationPolicy.PREFER]: -cpu_thread_policy = image_thread_policy +cpu_thread_policy = flavor_thread_policy or image_thread_policy The effect of this is that if the flavor specifies a CPU threading policy of "prefer" then we will use a policy of "prefer" regardless of the policy from the image. If the flavor specifies a CPU threading policy of None then we will use the policy from the image. This is a bug, because the original intent was to treat None and "prefer" identically, since "prefer" was just an explicit way to specify the default behaviour. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1687067 Title: problems with cpu and cpu-thread policy where flavor/image specify different settings Status in OpenStack Compute (nova): New Bug description: There are a number of issues related to CPU policy and CPU thread policy where the flavor extra-spec and image properties do not match up. The docs at https://docs.openstack.org/admin-guide/compute-cpu- topologies.html say the following: "Image metadata takes precedence over flavor extra specs. Thus, configuring competing policies causes an exception. By setting a shared policy through image metadata, administrators can prevent users configuring CPU policies in flavors and impacting resource utilization." For the CPU policy this is exactly backwards based on the code. The flavor is specified by the admin, and so it generally takes priority over the image which is specified by the end user. If the flavor specifies "dedicated" then the result is dedicated regardless of what the image specifies. If the flavor specifies "shared" then the result depends on the image--if it specifies "dedicated" then we will raise an exception, otherwise we use "shared". If the flavor doesn't specify a CPU policy then the image can specify whatever policy it wants. The issue around CPU threading policy is more complicated. Back in Mitaka, if the flavor specified a CPU threading policy of either None or "prefer" then we would use the threading policy specified by the image (if it was set). If the flavor specified a CPU threading policy of "isolate" or "require" and the image specified a different CPU threading policy then we raised exception.ImageCPUThreadPolicyForbidden(), otherwise we used the CPU threading policy specified by the flavor. This behaviour is described in the spec at https://specs.openstack.org/openstack/nova- specs/specs/mitaka/implemented/virt-driver-cpu-thread-pinning.html In git commit 24997343 (which went into Newton) Nikola Dipanov made a code change that doesn't match the intent in the git commit message: if flavor_thread_policy in [None, fields.CPUThreadAllocationPolicy.PREFER]: -cpu_thread_policy = image_thread_policy +
[Yahoo-eng-team] [Bug 1669054] [NEW] RequestSpec.ignore_hosts from resize is reused in subsequent evacuate
Public bug reported: When doing a resize, if CONF.allow_resize_to_same_host is False, then we set RequestSpec.ignore_hosts and then save the RequestSpec. When we go to use the same RequestSpec on a subsequent rebuild/evacuate, ignore_hosts is still set from the previous resize. In RequestSpec.reset_forced_destinations() we reset force_hosts and force_nodes, it might make sense to also reset ignore_hosts. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1669054 Title: RequestSpec.ignore_hosts from resize is reused in subsequent evacuate Status in OpenStack Compute (nova): New Bug description: When doing a resize, if CONF.allow_resize_to_same_host is False, then we set RequestSpec.ignore_hosts and then save the RequestSpec. When we go to use the same RequestSpec on a subsequent rebuild/evacuate, ignore_hosts is still set from the previous resize. In RequestSpec.reset_forced_destinations() we reset force_hosts and force_nodes, it might make sense to also reset ignore_hosts. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1669054/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1573288] Re: over time, horizon's admin -> overview page becomes very slow ....
*** This bug is a duplicate of bug 1508571 *** https://bugs.launchpad.net/bugs/1508571 ** This bug has been marked a duplicate of bug 1508571 Overview panels use too wide date range as default -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Dashboard (Horizon). https://bugs.launchpad.net/bugs/1573288 Title: over time, horizon's admin -> overview page becomes very slow Status in OpenStack Dashboard (Horizon): Incomplete Bug description: I've noticed that when logging into the admin account after a bunch of activity against the RDO installation, it takes a very long time (many minutes) before horizon loads (I think the issue is the overview admin page which is also the main landing page for logging in). The list includes overall activity including deleted projects. If you orchestrate lots of testing against the installation using "rally" you will see lots of projects get created and later deleted. As such I have an overview page which lists at the bottom: "Displaying 2035 items" Is it possible to do something about the Overview page either by displaying only the first 20 items, or changing the type of information being displayed? Logging into admin is very painful currently. Non-admin accounts login quickly. Version-Release number of selected component (if applicable): Liberty How reproducible: Always. Steps to Reproduce: Run rally against openstack in an endless loop. After a few days (or hours depending on what you do and how you do it) you will find horizon getting slower and slower. Originally reported against RDO here: https://bugzilla.redhat.com/show_bug.cgi?id=1329414 though this is likely a general issue. To manage notifications about this bug go to: https://bugs.launchpad.net/horizon/+bug/1573288/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1654345] Re: realtime emulatorpin should use pcpus, not vcpus
Looks like this has already been dealt with on Master via bug 1614054, commit 6683bf9. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1654345 Title: realtime emulatorpin should use pcpus, not vcpus Status in OpenStack Compute (nova): Invalid Bug description: When specifying "hw:cpu_realtime_mask" in the flavor, LibvirtDriver._get_guest_numa_config() calls hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and "vcpus_em". It then directly uses "vcpus_em" to set the "emulatorpin" cpuset. The problem is that libvirt expects the "emulatorpin" cpuset to be specified as physical CPUs, not virtual CPUs. This results in unexpected values being used for the emulator pinning. The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the pCPUs to the "emulatorpin" cpuset. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1654345/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1654345] [NEW] realtime emulatorpin should use pcpus, not vcpus
Public bug reported: When specifying "hw:cpu_realtime_mask" in the flavor, LibvirtDriver._get_guest_numa_config() calls hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and "vcpus_em". It then directly uses "vcpus_em" to set the "emulatorpin" cpuset. The problem is that libvirt expects the "emulatorpin" cpuset to be specified as physical CPUs, not virtual CPUs. This results in unexpected values being used for the emulator pinning. The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the pCPUs to the "emulatorpin" cpuset. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: New ** Tags: compute libvirt newton-backport-potential ** Description changed: When specifying "hw:cpu_realtime_mask" in the flavor, LibvirtDriver._get_guest_numa_config() calls hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and "vcpus_em". It then directly uses "vcpus_em" to set the "emulatorpin" cpuset. The problem is that libvirt expects the "emulatorpin" cpuset to be specified as physical CPUs, not virtual CPUs. + This results in unexpected values being used for the emulator pinning. + The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the pCPUs to the "emulatorpin" cpuset. ** Changed in: nova Assignee: (unassigned) => Chris Friesen (cbf123) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1654345 Title: realtime emulatorpin should use pcpus, not vcpus Status in OpenStack Compute (nova): New Bug description: When specifying "hw:cpu_realtime_mask" in the flavor, LibvirtDriver._get_guest_numa_config() calls hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and "vcpus_em". It then directly uses "vcpus_em" to set the "emulatorpin" cpuset. The problem is that libvirt expects the "emulatorpin" cpuset to be specified as physical CPUs, not virtual CPUs. This results in unexpected values being used for the emulator pinning. The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the pCPUs to the "emulatorpin" cpuset. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1654345/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1638961] [NEW] evacuating an instance loses files specified via "--file" on the cli
Public bug reported: I booted up an instance as follows in my stable/mitaka devstack environment: $ echo "this is a test" > /tmp/my_user_data.txt $ echo "blah1" > /tmp/file1 $ echo "blah2" > /tmp/file2 $ nova boot --flavor m1.tiny --image cirros-0.3.4-x86_64-uec --config-drive true --user-data /tmp/my_user_data.txt --file /root/file1=/tmp/file1 --file /tmp/file2=/tmp/file2 testing This booted up an instance, and within the guest I ran the following: $ mkdir mnt $ mount /dev/sr0 mnt $ cat mnt/openstack/latest/user_data this is a test $ umount mnt $ cat /root/file1 blah1 $ cat /tmp/file2 blah2 Then I killed the compute node and ran "nova evacuate testing". The evacuated instance had a config drive at /dev/sr0, but it did not have the /root/file1 or /tmp/file2 files. This is arguably incorrect. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1638961 Title: evacuating an instance loses files specified via "--file" on the cli Status in OpenStack Compute (nova): New Bug description: I booted up an instance as follows in my stable/mitaka devstack environment: $ echo "this is a test" > /tmp/my_user_data.txt $ echo "blah1" > /tmp/file1 $ echo "blah2" > /tmp/file2 $ nova boot --flavor m1.tiny --image cirros-0.3.4-x86_64-uec --config-drive true --user-data /tmp/my_user_data.txt --file /root/file1=/tmp/file1 --file /tmp/file2=/tmp/file2 testing This booted up an instance, and within the guest I ran the following: $ mkdir mnt $ mount /dev/sr0 mnt $ cat mnt/openstack/latest/user_data this is a test $ umount mnt $ cat /root/file1 blah1 $ cat /tmp/file2 blah2 Then I killed the compute node and ran "nova evacuate testing". The evacuated instance had a config drive at /dev/sr0, but it did not have the /root/file1 or /tmp/file2 files. This is arguably incorrect. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1638961/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1613488] Re: changed fields of versionedobjects not tracked properly when down-versioning object
The review for the oslo.versionedobjects change is here: https://review.openstack.org/#/c/355981/ ** Changed in: nova Status: New => Fix Released ** Project changed: nova => oslo.versionedobjects -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1613488 Title: changed fields of versionedobjects not tracked properly when down- versioning object Status in oslo.versionedobjects: Fix Released Bug description: Sorry for the complicated write-up below, but the issue is complicated. I'm running into a problem between Mitaka and Kilo, but I *think* it'll also hit Mitaka/Liberty. The problem scenario is when we have older and newer services talking to each other. The problem occurs when nova-conductor writes to an object field that is removed in obj_make_compatible(). In particular, I'm hitting this with 'parent_addr' in the PciDevice class since it gets written in PciDevice._from_db_object(). In oslo_versionedobjects/base.py the remotable() function has the following line: self._changed_fields = set(updates.get('obj_what_changed', [])) This blindly sets the local self._changed_fields to be whatever the remote end sent as updates['obj_what_changed']. This is a problem because the far end can include fields that don't actually exist in the older object version. On the far end (which may be newer) in nova.conductor.manager.ConductorManager.object_action(), we will call the following (where 'objinst' is the current version of the object): updates['obj_what_changed'] = objinst.obj_what_changed() Since this is called against the newer object code, it can specify fields that do not exist in the older version of the object if nova- conductor has written those fields. The only workaround I've been able to come up with for this is to modify oslo_versionedobjects.base.remotable() to only include a field in self._changed_fields if it's in self.fields. This requires updating the older code prior to an upgrade, however. I think there's another related issue. In VersionedObject.obj_to_primitive() we set the changes in the primitive like this: if self.obj_what_changed(): obj[self._obj_primitive_key('changes')] = list( self.obj_what_changed()) Since we call self.obj_what_changed() on the newer version of the object, I think we will include changes to fields that were removed by obj_make_compatible_from_manifest(). It seems to me that in obj_to_primitive() we should not allow fields to be included in obj[self._obj_primitive_key('changes')] unless they're also listed in obj[self._obj_primitive_key('data')]. To manage notifications about this bug go to: https://bugs.launchpad.net/oslo.versionedobjects/+bug/1613488/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1613488] [NEW] changed fields of versionedobjects not tracked properly when down-versioning object
Public bug reported: Sorry for the complicated write-up below, but the issue is complicated. I'm running into a problem between Mitaka and Kilo, but I *think* it'll also hit Mitaka/Liberty. The problem scenario is when we have older and newer services talking to each other. The problem occurs when nova-conductor writes to an object field that is removed in obj_make_compatible(). In particular, I'm hitting this with 'parent_addr' in the PciDevice class since it gets written in PciDevice._from_db_object(). In oslo_versionedobjects/base.py the remotable() function has the following line: self._changed_fields = set(updates.get('obj_what_changed', [])) This blindly sets the local self._changed_fields to be whatever the remote end sent as updates['obj_what_changed']. This is a problem because the far end can include fields that don't actually exist in the older object version. On the far end (which may be newer) in nova.conductor.manager.ConductorManager.object_action(), we will call the following (where 'objinst' is the current version of the object): updates['obj_what_changed'] = objinst.obj_what_changed() Since this is called against the newer object code, it can specify fields that do not exist in the older version of the object if nova- conductor has written those fields. The only workaround I've been able to come up with for this is to modify oslo_versionedobjects.base.remotable() to only include a field in self._changed_fields if it's in self.fields. This requires updating the older code prior to an upgrade, however. I think there's another related issue. In VersionedObject.obj_to_primitive() we set the changes in the primitive like this: if self.obj_what_changed(): obj[self._obj_primitive_key('changes')] = list( self.obj_what_changed()) Since we call self.obj_what_changed() on the newer version of the object, I think we will include changes to fields that were removed by obj_make_compatible_from_manifest(). It seems to me that in obj_to_primitive() we should not allow fields to be included in obj[self._obj_primitive_key('changes')] unless they're also listed in obj[self._obj_primitive_key('data')]. ** Affects: nova Importance: Undecided Status: New ** Tags: compute oslo -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1613488 Title: changed fields of versionedobjects not tracked properly when down- versioning object Status in OpenStack Compute (nova): New Bug description: Sorry for the complicated write-up below, but the issue is complicated. I'm running into a problem between Mitaka and Kilo, but I *think* it'll also hit Mitaka/Liberty. The problem scenario is when we have older and newer services talking to each other. The problem occurs when nova-conductor writes to an object field that is removed in obj_make_compatible(). In particular, I'm hitting this with 'parent_addr' in the PciDevice class since it gets written in PciDevice._from_db_object(). In oslo_versionedobjects/base.py the remotable() function has the following line: self._changed_fields = set(updates.get('obj_what_changed', [])) This blindly sets the local self._changed_fields to be whatever the remote end sent as updates['obj_what_changed']. This is a problem because the far end can include fields that don't actually exist in the older object version. On the far end (which may be newer) in nova.conductor.manager.ConductorManager.object_action(), we will call the following (where 'objinst' is the current version of the object): updates['obj_what_changed'] = objinst.obj_what_changed() Since this is called against the newer object code, it can specify fields that do not exist in the older version of the object if nova- conductor has written those fields. The only workaround I've been able to come up with for this is to modify oslo_versionedobjects.base.remotable() to only include a field in self._changed_fields if it's in self.fields. This requires updating the older code prior to an upgrade, however. I think there's another related issue. In VersionedObject.obj_to_primitive() we set the changes in the primitive like this: if self.obj_what_changed(): obj[self._obj_primitive_key('changes')] = list( self.obj_what_changed()) Since we call self.obj_what_changed() on the newer version of the object, I think we will include changes to fields that were removed by obj_make_compatible_from_manifest(). It seems to me that in obj_to_primitive() we should not allow fields to be included in obj[self._obj_primitive_key('changes')] unless they're also listed in obj[self._obj_primitive_key('data')]. To manage notifications about this bug go to:
[Yahoo-eng-team] [Bug 1605720] [NEW] backing store missing for ephemeral disk on migration with boot-from-vol
Public bug reported: I'm on stable/mitaka, but the master code looks similar. I have compute nodes configured to use qcow2 and libvirt. The flavor has an ephemeral disk and a swap disk. I boot an instance with this flavor, and the instance is boot-from-volume. When I try to cold-migrate the instance, I get an error: 2016-07-21 23:33:48.561 46340 ERROR nova.compute.manager [instance: 4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb] libvirtError: Cannot access backing file '/etc/nova/instances/_base/ephemeral_1_0706d66' of storage file '/etc/nova/instances/4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb/disk.eph0' (as uid:0, gid:0): No such file or directory The problem seems to be that in nova.virt.libvirt.driver.LibvirtDriver.finish_migration() we call self._create_image(...block_device_info=None...) Down in _create_image() we handle the case of a "disk.local" ephemeral device, but that doesn't help because the device is actually named "disk.eph0". It looks like we then try to loop over any ephemerals in block_device_info, but that's None so we don't handle any of those (which is too bad since it looks like they would be named correctly). The end result is that we have a qcow2 "disk.eph0" image, but with potentially no backing store in /_base. When we tell libvirt to start the instance, this results in the above error. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1605720 Title: backing store missing for ephemeral disk on migration with boot-from- vol Status in OpenStack Compute (nova): New Bug description: I'm on stable/mitaka, but the master code looks similar. I have compute nodes configured to use qcow2 and libvirt. The flavor has an ephemeral disk and a swap disk. I boot an instance with this flavor, and the instance is boot-from-volume. When I try to cold-migrate the instance, I get an error: 2016-07-21 23:33:48.561 46340 ERROR nova.compute.manager [instance: 4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb] libvirtError: Cannot access backing file '/etc/nova/instances/_base/ephemeral_1_0706d66' of storage file '/etc/nova/instances/4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb/disk.eph0' (as uid:0, gid:0): No such file or directory The problem seems to be that in nova.virt.libvirt.driver.LibvirtDriver.finish_migration() we call self._create_image(...block_device_info=None...) Down in _create_image() we handle the case of a "disk.local" ephemeral device, but that doesn't help because the device is actually named "disk.eph0". It looks like we then try to loop over any ephemerals in block_device_info, but that's None so we don't handle any of those (which is too bad since it looks like they would be named correctly). The end result is that we have a qcow2 "disk.eph0" image, but with potentially no backing store in /_base. When we tell libvirt to start the instance, this results in the above error. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1605720/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1602814] [NEW] hyperthreading bug in NUMATopologyFilter
Public bug reported: I recently ran into an issue where I was trying to boot an instance with 8 vCPUs, with hw:cpu_policy=dedicated. The host had 8 pCPUs available, but they were a mix of siblings and non-siblings. In virt.hardware._pack_instance_onto_cores(), the _get_pinning() function seems to be the culprit. It was called with the following inputs: (Pdb) threads_no 1 (Pdb) sibling_set [CoercedSet([63]), CoercedSet([49]), CoercedSet([48]), CoercedSet([50]), CoercedSet([59, 15]), CoercedSet([18, 62])] (Pdb) instance_cell.cpuset CoercedSet([0, 1, 2, 3, 4, 5, 6, 7]) As we can see, we are looking for 8 vCPUs, and there are 8 pCPUs available. However, when we call _get_pinning() it doesn't give us a mapping: > /usr/lib/python2.7/site-packages/nova/virt/hardware.py(899)_pack_instance_onto_cores() -> pinning = _get_pinning(threads_no, sibling_set, (Pdb) n > /usr/lib/python2.7/site-packages/nova/virt/hardware.py(900)_pack_instance_onto_cores() -> instance_cell.cpuset) (Pdb) n > /usr/lib/python2.7/site-packages/nova/virt/hardware.py(901)_pack_instance_onto_cores() -> if pinning: (Pdb) pinning This is a bug, if we haven't specified anything regarding hyperthreading then we should be able to run with a mix of siblings and non-siblings. ** Affects: nova Importance: Undecided Status: New ** Tags: compute numa scheduler -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1602814 Title: hyperthreading bug in NUMATopologyFilter Status in OpenStack Compute (nova): New Bug description: I recently ran into an issue where I was trying to boot an instance with 8 vCPUs, with hw:cpu_policy=dedicated. The host had 8 pCPUs available, but they were a mix of siblings and non-siblings. In virt.hardware._pack_instance_onto_cores(), the _get_pinning() function seems to be the culprit. It was called with the following inputs: (Pdb) threads_no 1 (Pdb) sibling_set [CoercedSet([63]), CoercedSet([49]), CoercedSet([48]), CoercedSet([50]), CoercedSet([59, 15]), CoercedSet([18, 62])] (Pdb) instance_cell.cpuset CoercedSet([0, 1, 2, 3, 4, 5, 6, 7]) As we can see, we are looking for 8 vCPUs, and there are 8 pCPUs available. However, when we call _get_pinning() it doesn't give us a mapping: > /usr/lib/python2.7/site-packages/nova/virt/hardware.py(899)_pack_instance_onto_cores() -> pinning = _get_pinning(threads_no, sibling_set, (Pdb) n > /usr/lib/python2.7/site-packages/nova/virt/hardware.py(900)_pack_instance_onto_cores() -> instance_cell.cpuset) (Pdb) n > /usr/lib/python2.7/site-packages/nova/virt/hardware.py(901)_pack_instance_onto_cores() -> if pinning: (Pdb) pinning This is a bug, if we haven't specified anything regarding hyperthreading then we should be able to run with a mix of siblings and non-siblings. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1602814/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1600304] [NEW] _update_usage_from_migrations() can end up processing stale migrations
Public bug reported: I recently found a bug in Mitaka, and it appears to be still present in master. I was testing a separate patch by doing resizes, and bugs in my code had resulted in a number of incomplete resizes involving compute-1. I then did a resize from compute-0 to compute-0, and saw compute-1's resource usage go up when it ran the resource audit. This got me curious, so I went digging and discovered a gap in the current resource audit logic. The problem arises if: 1) You have one or more stale migrations which didn't complete properly that involve the current compute node. 2) The instance from the uncompleted migration is currently doing a resize/migration that does not involve the current compute node. When this happens, _update_usage_from_migrations() will be passed in the stale migration, and since the instance is in fact in a resize state, the current compute node will erroneously account for the instance. (Even though the instance isn't doing anything involving the current compute node.) The fix is to check that the instance migration ID matches the ID of the migration being analyzed. This will work because in the case of the stale migration we will have hit the error case in _pair_instances_to_migrations(), and so the instance will be lazy-loaded from the DB, ensuring that its migration ID is up-to-date. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: In Progress ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1600304 Title: _update_usage_from_migrations() can end up processing stale migrations Status in OpenStack Compute (nova): In Progress Bug description: I recently found a bug in Mitaka, and it appears to be still present in master. I was testing a separate patch by doing resizes, and bugs in my code had resulted in a number of incomplete resizes involving compute-1. I then did a resize from compute-0 to compute-0, and saw compute-1's resource usage go up when it ran the resource audit. This got me curious, so I went digging and discovered a gap in the current resource audit logic. The problem arises if: 1) You have one or more stale migrations which didn't complete properly that involve the current compute node. 2) The instance from the uncompleted migration is currently doing a resize/migration that does not involve the current compute node. When this happens, _update_usage_from_migrations() will be passed in the stale migration, and since the instance is in fact in a resize state, the current compute node will erroneously account for the instance. (Even though the instance isn't doing anything involving the current compute node.) The fix is to check that the instance migration ID matches the ID of the migration being analyzed. This will work because in the case of the stale migration we will have hit the error case in _pair_instances_to_migrations(), and so the instance will be lazy-loaded from the DB, ensuring that its migration ID is up-to-date. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1600304/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1213224] Re: nova allows multiple aggregates with same zone name
Just to clarify something, availability zones don't "have" host aggregates. Rather, some host aggregates *are also* availability zones, but a given host can only be in one availability zone. I went and looked at the code, and the way it is currently written I think it is actually okay to have multiple host aggregates specifying the same availability zone. The logic in the AvailabilityZoneFilter is basically to loop over all host aggregates for the host in question, and if one of them has an availability zone (there should be only one) then the filter will check it against the availability zone specified by the user. As such, I'm going to close this bug. ** Changed in: nova Status: Confirmed => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1213224 Title: nova allows multiple aggregates with same zone name Status in OpenStack Compute (nova): Invalid Bug description: Currently (on grizzly), nova will let you specify multiple aggregates with the same zone name. This seems like a mismatch since the end-user can only specify an availability zone when creating an instance, and there could be multiple aggregates (with different hosts) mapping to that zone. On aggregate creation, nova should ensure that the availability zone name (if specified) is not a duplicate of an existing availability zone name. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1213224/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1590607] [NEW] incorrect handling of host numa cell usage with instances having no numa topology
Public bug reported: I think there is a problem in host NUMA node resource tracking when there is an instance with no numa topology on the same node as instances with numa topology. It's triggered while running the resource audit, which ultimately calls hardware.get_host_numa_usage_from_instance() and assigns the result to self.compute_node.numa_topology. The problem occurs if you have a number of instances with numa topology, and then an instance with no numa topology. When running numa_usage_from_instances() for the instance with no numa topology we cache the values of "memory_usage" and "cpu_usage". However, because instance.cells is empty we don't enter the loop. Since the two lines in this commit are indented too far they don't get called, and we end up appending a host cell with "cpu_usage" and "memory_usage" of zero. This results in a host numa_topology cell with incorrect "cpu_usage" and "memory_usage" values, though I think the overall host cpu/memory usage is still correct. The fix is to reduce the indentation of the two lines in question so that they get called even when the instance has no numa topology. This writes the original host cell usage information back to it. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: New ** Tags: compute scheduler ** Changed in: nova Assignee: (unassigned) => Chris Friesen (cbf123) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1590607 Title: incorrect handling of host numa cell usage with instances having no numa topology Status in OpenStack Compute (nova): New Bug description: I think there is a problem in host NUMA node resource tracking when there is an instance with no numa topology on the same node as instances with numa topology. It's triggered while running the resource audit, which ultimately calls hardware.get_host_numa_usage_from_instance() and assigns the result to self.compute_node.numa_topology. The problem occurs if you have a number of instances with numa topology, and then an instance with no numa topology. When running numa_usage_from_instances() for the instance with no numa topology we cache the values of "memory_usage" and "cpu_usage". However, because instance.cells is empty we don't enter the loop. Since the two lines in this commit are indented too far they don't get called, and we end up appending a host cell with "cpu_usage" and "memory_usage" of zero. This results in a host numa_topology cell with incorrect "cpu_usage" and "memory_usage" values, though I think the overall host cpu/memory usage is still correct. The fix is to reduce the indentation of the two lines in question so that they get called even when the instance has no numa topology. This writes the original host cell usage information back to it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1590607/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1590133] [NEW] help text for cpu_allocation_ratio is wrong
Public bug reported: In stable/mitaka in resource_tracker.py the help text for the cpu_allocation_ratio config option reads in part: 'NOTE: This can be set per-compute, or if set to 0.0, the value ' 'set on the scheduler node(s) will be used ' 'and defaulted to 16.0'), However, there is no longer any value set on the scheduler node(s). They use the per-compute-node value set in resource_tracker.py. Instead, if the value is 0.0 then ComputeNode._from_db_object() will convert the value to 16.0. This ensures that the scheduler filters see a value of 16.0 by default. In Newton the plan appears to be to change the default value to an explicit 16.0 (and presumably updating the help text) but that doesn't help the already-released Mitaka code. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1590133 Title: help text for cpu_allocation_ratio is wrong Status in OpenStack Compute (nova): New Bug description: In stable/mitaka in resource_tracker.py the help text for the cpu_allocation_ratio config option reads in part: 'NOTE: This can be set per-compute, or if set to 0.0, the value ' 'set on the scheduler node(s) will be used ' 'and defaulted to 16.0'), However, there is no longer any value set on the scheduler node(s). They use the per-compute-node value set in resource_tracker.py. Instead, if the value is 0.0 then ComputeNode._from_db_object() will convert the value to 16.0. This ensures that the scheduler filters see a value of 16.0 by default. In Newton the plan appears to be to change the default value to an explicit 16.0 (and presumably updating the help text) but that doesn't help the already-released Mitaka code. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1590133/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1590091] [NEW] bug in handling of ISOLATE thread policy
Public bug reported: I'm running stable/mitaka in devstack. I've got a small system with 2 pCPUs, both marked as available for pinning. They're two cores of a single processor, no threads. "virsh capabilities" shows: It is my understanding that I should be able to boot up an instance with two dedicated CPUs and a thread policy of ISOLATE, since I have two physical cores and no threads. (Is this correct?) Unfortunately, the NUMATopology filter fails my host. The problem is in _pack_instance_onto_cores(): if (instance_cell.cpu_thread_policy == fields.CPUThreadAllocationPolicy.ISOLATE): # make sure we have at least one fully free core if threads_per_core not in sibling_sets: return pinning = _get_pinning(1, # we only want to "use" one thread per core sibling_sets[threads_per_core], instance_cell.cpuset) Right before the call to _get_pinning() we have the following: (Pdb) instance_cell.cpu_thread_policy u'isolate' (Pdb) threads_per_core 1 (Pdb) sibling_sets defaultdict(, {1: [CoercedSet([0, 1])], 2: [CoercedSet([0, 1])]}) (Pdb) sibling_sets[threads_per_core] [CoercedSet([0, 1])] (Pdb) instance_cell.cpuset CoercedSet([0, 1]) In this code snippet, _get_pinning() returns None, causing the filter to fail the host. Tracing a bit further in, in _get_pinning() we have the following line: if threads_no * len(sibling_set) < len(instance_cores): return Coming into this line of code the variables look like this: (Pdb) threads_no 1 (Pdb) sibling_set [CoercedSet([0, 1])] (Pdb) len(sibling_set) 1 (Pdb) instance_cores CoercedSet([0, 1]) (Pdb) len(instance_cores) 2 So the test evaluates to True, and we bail out. I don't think this is correct, we should be able to schedule on this host. ** Affects: nova Importance: Undecided Status: New ** Tags: compute scheduler -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1590091 Title: bug in handling of ISOLATE thread policy Status in OpenStack Compute (nova): New Bug description: I'm running stable/mitaka in devstack. I've got a small system with 2 pCPUs, both marked as available for pinning. They're two cores of a single processor, no threads. "virsh capabilities" shows: It is my understanding that I should be able to boot up an instance with two dedicated CPUs and a thread policy of ISOLATE, since I have two physical cores and no threads. (Is this correct?) Unfortunately, the NUMATopology filter fails my host. The problem is in _pack_instance_onto_cores(): if (instance_cell.cpu_thread_policy == fields.CPUThreadAllocationPolicy.ISOLATE): # make sure we have at least one fully free core if threads_per_core not in sibling_sets: return pinning = _get_pinning(1, # we only want to "use" one thread per core sibling_sets[threads_per_core], instance_cell.cpuset) Right before the call to _get_pinning() we have the following: (Pdb) instance_cell.cpu_thread_policy u'isolate' (Pdb) threads_per_core 1 (Pdb) sibling_sets defaultdict(, {1: [CoercedSet([0, 1])], 2: [CoercedSet([0, 1])]}) (Pdb) sibling_sets[threads_per_core] [CoercedSet([0, 1])] (Pdb) instance_cell.cpuset CoercedSet([0, 1]) In this code snippet, _get_pinning() returns None, causing the filter to fail the host. Tracing a bit further in, in _get_pinning() we have the following line: if threads_no * len(sibling_set) < len(instance_cores): return Coming into this line of code the variables look like this: (Pdb) threads_no 1 (Pdb) sibling_set [CoercedSet([0, 1])] (Pdb) len(sibling_set) 1 (Pdb) instance_cores CoercedSet([0, 1]) (Pdb) len(instance_cores) 2 So the test evaluates to True, and we bail out. I don't think this is correct, we should be able to schedule on this host. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1590091/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1577642] [NEW] race between disk_available_least and instance operations
Public bug reported: The calculation for LibvirtDriver._get_disk_over_committed_size_total() loops over all the instances on the hypervisor to try to figure out the total overcommitted size for all instances. However, at the time that routine is called from ResourceTracker.update_available_resource() we do not hold COMPUTE_RESOURCE_SEMAPHORE. This means that instance claims can be modified (due to instance creation/deletion/resize/migration/etc), potentially causing the calculated value for data['disk_available_least'] to not actually reflect current reality, and potentially allowing different eventlets to have different views of data['disk_available_least']. There was a related bug reported some time back (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with the underlying race condition they just sort of papered over it by ignoring the InstanceNotFound exception. ** Affects: nova Importance: Undecided Status: New ** Tags: compute race-condition ** Description changed: The calculation for LibvirtDriver._get_disk_over_committed_size_total() loops over all the instances on the hypervisor to try to figure out the total overcommitted size for all instances. However, at the time that routine is called from ResourceTracker.update_available_resource() we do not hold COMPUTE_RESOURCE_SEMAPHORE. This means that instances can be created/destroyed/resized, causing the calculated value for data['disk_available_least'] to not actually reflect current reality. + + There was a related bug reported some time back + (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with + the underlying race condition they just sort of papered over it by + ignoring the InstanceNotFound exception. ** Description changed: The calculation for LibvirtDriver._get_disk_over_committed_size_total() loops over all the instances on the hypervisor to try to figure out the total overcommitted size for all instances. However, at the time that routine is called from ResourceTracker.update_available_resource() we do not hold - COMPUTE_RESOURCE_SEMAPHORE. This means that instances can be - created/destroyed/resized, causing the calculated value for - data['disk_available_least'] to not actually reflect current reality. + COMPUTE_RESOURCE_SEMAPHORE. This means that instance claims can be + modified (due to instance creation/deletion/resize/migration/etc), + causing the calculated value for data['disk_available_least'] to not + actually reflect current reality. There was a related bug reported some time back (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with the underlying race condition they just sort of papered over it by ignoring the InstanceNotFound exception. ** Description changed: The calculation for LibvirtDriver._get_disk_over_committed_size_total() loops over all the instances on the hypervisor to try to figure out the total overcommitted size for all instances. However, at the time that routine is called from ResourceTracker.update_available_resource() we do not hold COMPUTE_RESOURCE_SEMAPHORE. This means that instance claims can be modified (due to instance creation/deletion/resize/migration/etc), - causing the calculated value for data['disk_available_least'] to not - actually reflect current reality. + potentially causing the calculated value for + data['disk_available_least'] to not actually reflect current reality, + and potentially allowing different eventlets to have different views of + data['disk_available_least']. There was a related bug reported some time back (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with the underlying race condition they just sort of papered over it by ignoring the InstanceNotFound exception. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1577642 Title: race between disk_available_least and instance operations Status in OpenStack Compute (nova): New Bug description: The calculation for LibvirtDriver._get_disk_over_committed_size_total() loops over all the instances on the hypervisor to try to figure out the total overcommitted size for all instances. However, at the time that routine is called from ResourceTracker.update_available_resource() we do not hold COMPUTE_RESOURCE_SEMAPHORE. This means that instance claims can be modified (due to instance creation/deletion/resize/migration/etc), potentially causing the calculated value for data['disk_available_least'] to not actually reflect current reality, and potentially allowing different eventlets to have different views of data['disk_available_least']. There was a related bug reported some time back (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with the underlying race condition they just sort of
[Yahoo-eng-team] [Bug 1552777] [NEW] resizing from flavor with swap to one without swap puts instance into Error status
Public bug reported: In a single-node devstack (current trunk, nova commit 6e1051b7), if you boot an instance with a flavor that has nonzero swap and then resize to a flavor with zero swap it causes an exception. It seems that we somehow neglect to remove the swap file from the instance. 2016-03-03 10:02:41.415 ERROR nova.virt.libvirt.guest [req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] Error launching a defined domain with XML: instance-0001 54711b56-fa72-4eac-a5d3-aa29ed128098 http://openstack.org/xmlns/libvirt/nova/1.0;> asdf 2016-03-03 16:02:39 512 1 0 0 1 admin alt_demo 524288 524288 1 1024 OpenStack Foundation OpenStack Nova 13.0.0 03000200-0400-0500-0006-000700080009 54711b56-fa72-4eac-a5d3-aa29ed128098 Virtual Machine hvm /opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/kernel /opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/ramdisk root=/dev/vda console=tty0 console=ttyS0 destroy restart destroy /usr/bin/kvm-spice 2016-03-03 10:02:41.417 ERROR nova.compute.manager [req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] Setting instance vm_state to ERROR 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] Traceback (most recent call last): 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/compute/manager.py", line 3999, in finish_resize 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] disk_info, image_meta) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/compute/manager.py", line 3964, in _finish_resize 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] old_instance_type) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__ 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] six.reraise(self.type_, self.value, self.tb) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/compute/manager.py", line 3959, in _finish_resize 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] block_device_info, power_on) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 7202, in finish_migration 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] vifs_already_plugged=True) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4862, in _create_domain_and_network 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] xml, pause=pause, power_on=power_on) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4793, in _create_domain 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] guest.launch(pause=pause) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 142, in launch 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] self._encoded_xml, errors='ignore') 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__ 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] six.reraise(self.type_, self.value, self.tb) 2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 54711b56-fa72-4eac-a5d3-aa29ed128098] File
[Yahoo-eng-team] [Bug 1549032] [NEW] max_net_count doesn't interact properly with min_count when booting multiple instances
Public bug reported: In compute.api.API._create_instance() we have a min_count that is optionally passed in by the end user as part of the boot request. We calculate max_net_count based on networking constraints. Currently we error out if max_net_count is zero, but we don't check it against min_count. If the end user specifies a min_count that is greater than the calculated max_net_count the resulting error isn't very useful. We know that min_count is guaranteed to be at least 1, so we can replace the existing test against zero with one against min_count. Doing this gives a much more reasonable error message: controller-0:~$ nova boot --image myimage --flavor simple --min-count 2 --max-count 3 test ERROR (Forbidden): Maximum number of ports exceeded (HTTP 403) (Request-ID: req-f7ff28bf-5708-4cbf-a634-2e9686afd970) ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1549032 Title: max_net_count doesn't interact properly with min_count when booting multiple instances Status in OpenStack Compute (nova): In Progress Bug description: In compute.api.API._create_instance() we have a min_count that is optionally passed in by the end user as part of the boot request. We calculate max_net_count based on networking constraints. Currently we error out if max_net_count is zero, but we don't check it against min_count. If the end user specifies a min_count that is greater than the calculated max_net_count the resulting error isn't very useful. We know that min_count is guaranteed to be at least 1, so we can replace the existing test against zero with one against min_count. Doing this gives a much more reasonable error message: controller-0:~$ nova boot --image myimage --flavor simple --min-count 2 --max-count 3 test ERROR (Forbidden): Maximum number of ports exceeded (HTTP 403) (Request-ID: req-f7ff28bf-5708-4cbf-a634-2e9686afd970) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1549032/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1542039] [NEW] nova should not reschedule an instance that has already been deleted
Public bug reported: I'm investigating an issue where an instance with a large disk and an attached cinder volume was booted in a stable/kilo OpenStack setup with the diskFilter disabled. The timeline looks like this: scheduler picks initial compute node nova attempts to boot it up on one compute node, it runs out of disk space and gets rescheduled scheduler picks another compute node user requests instance deletion user requests cinder volume deletion nova attempts to boot it up on second compute node, it runs out of disk space and gets rescheduled scheduler picks a third compute node nova attempts to boot it up on third compute node, runs into problems due to missing cinder volume The issue I want to address in this bug is whether it makes sense to reschedule the instance when the instance has already been deleted. Also, instance deletion sets the task_state to 'deleting' early on. In compute.manager.ComputeManager._do_build_and_run_instance(), if we decide to reschedule then nova-compute will set the task_state to 'scheduling' and then save the instance, which I think could overwrite the 'deleting' state in the DB. So...would it make sense to have nova-compute put an "expected_task_state" on the instance.save() call that sets the 'scheduling' task_state? ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1542039 Title: nova should not reschedule an instance that has already been deleted Status in OpenStack Compute (nova): New Bug description: I'm investigating an issue where an instance with a large disk and an attached cinder volume was booted in a stable/kilo OpenStack setup with the diskFilter disabled. The timeline looks like this: scheduler picks initial compute node nova attempts to boot it up on one compute node, it runs out of disk space and gets rescheduled scheduler picks another compute node user requests instance deletion user requests cinder volume deletion nova attempts to boot it up on second compute node, it runs out of disk space and gets rescheduled scheduler picks a third compute node nova attempts to boot it up on third compute node, runs into problems due to missing cinder volume The issue I want to address in this bug is whether it makes sense to reschedule the instance when the instance has already been deleted. Also, instance deletion sets the task_state to 'deleting' early on. In compute.manager.ComputeManager._do_build_and_run_instance(), if we decide to reschedule then nova-compute will set the task_state to 'scheduling' and then save the instance, which I think could overwrite the 'deleting' state in the DB. So...would it make sense to have nova-compute put an "expected_task_state" on the instance.save() call that sets the 'scheduling' task_state? To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1542039/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1538619] [NEW] Fix up argument order in remove_volume_connection()
Public bug reported: The RPC API function for remove_volume_connection() uses a different argument order than the ComputeManager function of the same name. The normal RPC code uses named arguments, but the _ComputeV4Proxy version doesn't, and it has the order wrong. This causes problems when called by _rollback_live_migration(). The fix seems to be trivial: diff --git a/nova/compute/manager.py b/nova/compute/manager.py index d6efd18..65c1b75 100644 --- a/nova/compute/manager.py +++ b/nova/compute/manager.py @@ -6870,7 +6870,8 @@ class _ComputeV4Proxy(object): instance) def remove_volume_connection(self, ctxt, instance, volume_id): -return self.manager.remove_volume_connection(ctxt, instance, volume_id) +# The RPC API uses different argument order than the local API. +return self.manager.remove_volume_connection(ctxt, volume_id, instance) def rescue_instance(self, ctxt, instance, rescue_password, rescue_image_ref, clean_shutdown): Given that this only applies to stable/kilo I'm guessing there's no point in trying to push a patch, but I thought I'd include this here in case anyone else runs into it. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1538619 Title: Fix up argument order in remove_volume_connection() Status in OpenStack Compute (nova): New Bug description: The RPC API function for remove_volume_connection() uses a different argument order than the ComputeManager function of the same name. The normal RPC code uses named arguments, but the _ComputeV4Proxy version doesn't, and it has the order wrong. This causes problems when called by _rollback_live_migration(). The fix seems to be trivial: diff --git a/nova/compute/manager.py b/nova/compute/manager.py index d6efd18..65c1b75 100644 --- a/nova/compute/manager.py +++ b/nova/compute/manager.py @@ -6870,7 +6870,8 @@ class _ComputeV4Proxy(object): instance) def remove_volume_connection(self, ctxt, instance, volume_id): -return self.manager.remove_volume_connection(ctxt, instance, volume_id) +# The RPC API uses different argument order than the local API. +return self.manager.remove_volume_connection(ctxt, volume_id, instance) def rescue_instance(self, ctxt, instance, rescue_password, rescue_image_ref, clean_shutdown): Given that this only applies to stable/kilo I'm guessing there's no point in trying to push a patch, but I thought I'd include this here in case anyone else runs into it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1538619/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1536703] [NEW] unable to re-issue confirm/revert of resize
Public bug reported: If we call confirm_resize() that sets migration.status to 'confirming' and sends an RPC cast to the compute node. If there's a glitch and that cast is received but never processed, there's no way to confirm the resize since it only looks for migrations with a status of "finished". It looks like it should be safe as-is to allow calling confirm_resize on a migration in the "confirming" state since it's already synchronized on the instance. A similar problem holds for an interrupted revert_resize(), but in that case there's no synchronization currently. Not sure if that's a problem or not. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1536703 Title: unable to re-issue confirm/revert of resize Status in OpenStack Compute (nova): New Bug description: If we call confirm_resize() that sets migration.status to 'confirming' and sends an RPC cast to the compute node. If there's a glitch and that cast is received but never processed, there's no way to confirm the resize since it only looks for migrations with a status of "finished". It looks like it should be safe as-is to allow calling confirm_resize on a migration in the "confirming" state since it's already synchronized on the instance. A similar problem holds for an interrupted revert_resize(), but in that case there's no synchronization currently. Not sure if that's a problem or not. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1536703/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1528325] [NEW] instance with explicit "small" pages treated different from implicit
Public bug reported: In numa_get_constraints() we call pagesize = _numa_get_pagesize_constraints(flavor, image_meta) then later we have if nodes or pagesize: [setattr(c, 'pagesize', pagesize) for c in numa_topology.cells] This ends up treating an instance which doesn't specify pagesize (which results in 4K pages) differently from an instance that explicitly specifies 4K pages. In the first case the instance may not have a numa topology specified, while in the second case it does. In _get_guest_numa_config() we check whether the guest has a numa topology, and if it does we restrict it to a single NUMA node rather than letting it float across the whole host. This unexpectedly results in different CPU and memory affinity depending on whether an instance implicitly assumes 4K pages or explicitly specifies them. ** Affects: nova Importance: Undecided Status: New ** Tags: compute numa ** Summary changed: - explicit "small" pages treated different from implicit + instance with explicit "small" pages treated different from implicit -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1528325 Title: instance with explicit "small" pages treated different from implicit Status in OpenStack Compute (nova): New Bug description: In numa_get_constraints() we call pagesize = _numa_get_pagesize_constraints(flavor, image_meta) then later we have if nodes or pagesize: [setattr(c, 'pagesize', pagesize) for c in numa_topology.cells] This ends up treating an instance which doesn't specify pagesize (which results in 4K pages) differently from an instance that explicitly specifies 4K pages. In the first case the instance may not have a numa topology specified, while in the second case it does. In _get_guest_numa_config() we check whether the guest has a numa topology, and if it does we restrict it to a single NUMA node rather than letting it float across the whole host. This unexpectedly results in different CPU and memory affinity depending on whether an instance implicitly assumes 4K pages or explicitly specifies them. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1528325/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1512907] [NEW] leak of vswitch port if delete an instance while resizing
Public bug reported: I've been testing with a modified version of stable/kilo, but I believe the bug is present in upstream stable/kilo. When using nova with neutron, if I boot an instance, then trigger a resize, and then delete the instance at just the right point during the resize it ends up causing a vswitch port to be "leaked". So far I've been able to show that if I issue the "nova delete" command while the resize operation is anywhere in nova.compute.manager.ComputeManager._finish_resize() up to the point where we set "instance.vm_state = vm_states.RESIZED" then I end up "leaking" a vswitch port. (By "leaking" I mean that it stays allocated even after the instance that it was allocated for is deleted.) I've been testing this by calling pdb.set_trace() to pause the resize while the nova delete runs, then letting the resize continue. Yes, this exaggerates the timing issues, but it shouldn't introduce any new races if the code is correct. I think the problem occurs because the deletion path can't confirm the migration/resize because it hasn't gotten to the proper state yet. The resize code takes various exceptions depending on the exact timing of when the deletion happens, but it doesn't trigger a revert of the resize and it doesn't clean up the vswitch port on the source host. See sample log below. I'm not sure what the proper fix is for this case. It seems that until we set "instance.vm_state = vm_states.RESIZED" it should be up to the resize code to clean up all resources if the instance gets deleted while a resize is in progress. Sample log on source compute node. This is with a pause right at the beginning of _finish_resize(). (Pdb) c 2015-11-03 23:28:37.968 17000 INFO nova.compute.resource_tracker [req-2d9812a5-eadf-4eb2-96a3-d46e496b292d - - - - -] Auditing locally available compute resources for node compute-1 2015-11-03 23:28:38.511 17000 INFO nova.network.neutronv2.api [req-46519d0e-6e80-40e1-bbb3-dc39184eb046 41f42dfc41f9428fb143623f0a83d2fa 726f4a1ce23f4f12acb9139dcfcdb313 - - -] [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] Port dad525cf-75af-47a7-a57c-cc6fa26a6cf2 from network info_cache is no longer associated with instance in Neutron. Removing from network info_cache. 2015-11-03 23:28:38.677 17000 ERROR nova.compute.manager [req-46519d0e-6e80-40e1-bbb3-dc39184eb046 41f42dfc41f9428fb143623f0a83d2fa 726f4a1ce23f4f12acb9139dcfcdb313 - - -] [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] Setting instance vm_state to ERROR 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] Traceback (most recent call last): 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "/usr/lib64/python2.7/site-packages/nova/compute/manager.py", line 4350, in finish_resize 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] disk_info, image) 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "/usr/lib64/python2.7/site-packages/nova/compute/manager.py", line 4247, in _finish_resize 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] resize_instance = False 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7866, in macs_for_instance 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/pci/manager.py", line 380, in get_instance_pci_devs 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/objects/base.py", line 72, in getter 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/objects/instance.py", line 1022, in obj_load_attr 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/objects/instance.py", line 904, in _load_generic 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/objects/base.py", line 161, in wrapper 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "./usr/lib64/python2.7/site-packages/nova/conductor/rpcapi.py", line 335, in object_class_action 2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 21a4ff59-9057-4bc2-8b5e-185374c2d557] File "/usr/lib64/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 2015-11-03 23:28:38.677
[Yahoo-eng-team] [Bug 1471997] Re: nova MAX_FUNC value in nova/pci/devspec.py is too low
Jay Pipes helpfully pointed out that the MAX_FUNC value was defined by the PCI spec, and didn't refer to the SRIOV VF value, but rather the PCI device function. The original issue turned out to be a local problem generating the PCI whitelist. ** Changed in: nova Status: In Progress = Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1471997 Title: nova MAX_FUNC value in nova/pci/devspec.py is too low Status in OpenStack Compute (nova): Invalid Bug description: The MAX_FUNC value in nova/pci/devspec.py is set to 0x7. This limits us to a relatively small number of VFs per PF, which is annoying when trying to use SRIOV in any sort of serious way. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1471997/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1484742] Re: NUMATopologyFilter doesn't account for CPU/RAM overcommit
** Changed in: nova Status: In Progress = Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1484742 Title: NUMATopologyFilter doesn't account for CPU/RAM overcommit Status in OpenStack Compute (nova): Invalid Bug description: There seems to be a bug in the NUMATopologyFilter where it doesn't properly account for cpu_allocation_ratio or ram_allocation_ratio. (Detected on stable/kilo, not sure if it applies to current master.) To reproduce: 1) Create a flavor with a moderate number of CPUs (5, for example) and enable hugepages by setting hw:mem_page_size=2048 in the flavor extra specs. Do not specify dedicated CPUs on the flavor. 2) Ensure that the available compute nodes have fewer CPUs free than the number of CPUs in the flavor above. 3) Ensure that the cpu_allocation_ratio is big enough that num_free_cpus * cpu_allocation_ratio is more than the number of CPUs in the flavor above. 4) Enable the NUMATopologyFilter for the nova filter scheduler. 5) Try to boot an instance with the specified flavor. This should pass, because we're not using dedicated CPUs and so the cpu_allocation_ratio should apply. However, the NUMATopologyFilter returns 0 hosts. It seems like the NUMATopologyFilter is failing to properly account for the cpu_allocation_ratio when checking whether an instance can fit onto a given host. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1484742/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1485631] [NEW] CPU/RAM overcommit treated differently by normal and NUMA topology case
Public bug reported: Currently in the NUMA topology case (so multi-node guest, dedicated CPUs, hugepages in the guest, etc.) a single guest is not allowed to consume more CPU/RAM than the host actually has in total regardless of the specified overcommit ratio. In other words, the overcommit ratio only applies when the host resources are being used by multiple guests. A given host resource can only be used once by any particular guest. So as an example, if the host has 2 pCPUs in total for guests, a single guest instance is not allowed to use more than 2CPUs but you might be able to have 16 such instances running. (Assuming default CPU overcommit ratio.) However, this is not true when the NUMA topology is not involved. In that case a host with 2 pCPUs would allow a guest with 3 vCPUs to be spawned. We should pick one behaviour as correct and adjust the other one to match. Given that the NUMA topology case was discussed more recently, it seems reasonable to select it as the correct behaviour. ** Affects: nova Importance: Undecided Status: New ** Tags: compute scheduler -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1485631 Title: CPU/RAM overcommit treated differently by normal and NUMA topology case Status in OpenStack Compute (nova): New Bug description: Currently in the NUMA topology case (so multi-node guest, dedicated CPUs, hugepages in the guest, etc.) a single guest is not allowed to consume more CPU/RAM than the host actually has in total regardless of the specified overcommit ratio. In other words, the overcommit ratio only applies when the host resources are being used by multiple guests. A given host resource can only be used once by any particular guest. So as an example, if the host has 2 pCPUs in total for guests, a single guest instance is not allowed to use more than 2CPUs but you might be able to have 16 such instances running. (Assuming default CPU overcommit ratio.) However, this is not true when the NUMA topology is not involved. In that case a host with 2 pCPUs would allow a guest with 3 vCPUs to be spawned. We should pick one behaviour as correct and adjust the other one to match. Given that the NUMA topology case was discussed more recently, it seems reasonable to select it as the correct behaviour. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1485631/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1484742] [NEW] NUMATopologyFilter doesn't account for cpu_allocation_ratio
Public bug reported: There seems to be a bug in the NUMATopologyFilter where it doesn't properly account for cpu_allocation_ratio. (Detected on stable/kilo, not sure if it applies to current master.) To reproduce: 1) Create a flavor with a moderate number of CPUs (5, for example) and enable hugepages by setting hw:mem_page_size=2048 in the flavor extra specs. Do not specify dedicated CPUs on the flavor. 2) Ensure that the available compute nodes have fewer CPUs free than the number of CPUs in the flavor above. 3) Ensure that the cpu_allocation_ratio is big enough that num_free_cpus * cpu_allocation_ratio is more than the number of CPUs in the flavor above. 4) Enable the NUMATopologyFilter for the nova filter scheduler. 5) Try to boot an instance with the specified flavor. This should pass, because we're not using dedicated CPUs and so the cpu_allocation_ratio should apply. However, the NUMATopologyFilter returns 0 hosts. It seems like the NUMATopologyFilter is failing to properly account for the cpu_allocation_ratio when checking whether an instance can fit onto a given host. ** Affects: nova Importance: Undecided Status: New ** Tags: compute scheduler ** Description changed: There seems to be a bug in the NUMATopologyFilter where it doesn't - properly account for cpu_allocation_ratio. + properly account for cpu_allocation_ratio. (Detected on stable/kilo, + not sure if it applies to current master.) To reproduce: 1) Create a flavor with a moderate number of CPUs (5, for example) and enable hugepages by setting hw:mem_page_size=2048 in the flavor extra specs. Do not specify dedicated CPUs on the flavor. 2) Ensure that the available compute nodes have fewer CPUs free than the number of CPUs in the flavor above. 3) Ensure that the cpu_allocation_ratio is big enough that num_free_cpus * cpu_allocation_ratio is more than the number of CPUs in the flavor above. 4) Enable the NUMATopologyFilter for the nova filter scheduler. 5) Try to boot an instance with the specified flavor. This should pass, because we're not using dedicated CPUs and so the cpu_allocation_ratio should apply. However, the NUMATopologyFilter returns 0 hosts. It seems like the NUMATopologyFilter is failing to properly account for the cpu_allocation_ratio when checking whether an instance can fit onto a given host. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1484742 Title: NUMATopologyFilter doesn't account for cpu_allocation_ratio Status in OpenStack Compute (nova): New Bug description: There seems to be a bug in the NUMATopologyFilter where it doesn't properly account for cpu_allocation_ratio. (Detected on stable/kilo, not sure if it applies to current master.) To reproduce: 1) Create a flavor with a moderate number of CPUs (5, for example) and enable hugepages by setting hw:mem_page_size=2048 in the flavor extra specs. Do not specify dedicated CPUs on the flavor. 2) Ensure that the available compute nodes have fewer CPUs free than the number of CPUs in the flavor above. 3) Ensure that the cpu_allocation_ratio is big enough that num_free_cpus * cpu_allocation_ratio is more than the number of CPUs in the flavor above. 4) Enable the NUMATopologyFilter for the nova filter scheduler. 5) Try to boot an instance with the specified flavor. This should pass, because we're not using dedicated CPUs and so the cpu_allocation_ratio should apply. However, the NUMATopologyFilter returns 0 hosts. It seems like the NUMATopologyFilter is failing to properly account for the cpu_allocation_ratio when checking whether an instance can fit onto a given host. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1484742/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1482416] [NEW] bug blocks DB migration that changes column type
Public bug reported: I'm trying to make the following change as a DB migration +# Table instances, modify field 'vcpus_used' to Float +compute_nodes = Table('compute_nodes', meta, autoload=True) +vcpus_used = getattr(compute_nodes.c, 'vcpus_used') +vcpus_used.alter(type=Float) This works at runtime (using PostgreSQL) but when running the unit tests (using sqlite) I get the following: nova.tests.unit.db.test_migrations.TestNovaMigrationsSQLite.test_models_sync Captured traceback: ~~~ Traceback (most recent call last): File /home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_migrations.py, line 588, in test_models_sync Models and migration scripts aren't in sync:\n%s % msg) File /home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/unittest2/case.py, line 690, in fail raise self.failureException(msg) AssertionError: Models and migration scripts aren't in sync: [ ( 'add_constraint', UniqueConstraint(Column('host', String(length=255), table=compute_nodes), Column('hypervisor_hostname', String(length=255), table=compute_nodes), Column('deleted', Integer(), table=compute_nodes, default=ColumnDefault(0] This appears to be an interaction between two things, the change I made to alter the vcpus_used column, and a previous change (commit 2db4a1ac, migration version 279) to add the uniq_compute_nodes0host0hypervisor_hostname constraint. sqlite doesn't support altering columns, so there's a workaround that makes a new column, copies the contents over, and deletes the old one...this seems to be running into problems with the modified constraint. I suspect that this means that anyone wanting to change the type of a column in the compute_nodes table will run into a similar problem. ** Affects: nova Importance: Undecided Status: New ** Tags: compute db -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1482416 Title: bug blocks DB migration that changes column type Status in OpenStack Compute (nova): New Bug description: I'm trying to make the following change as a DB migration +# Table instances, modify field 'vcpus_used' to Float +compute_nodes = Table('compute_nodes', meta, autoload=True) +vcpus_used = getattr(compute_nodes.c, 'vcpus_used') +vcpus_used.alter(type=Float) This works at runtime (using PostgreSQL) but when running the unit tests (using sqlite) I get the following: nova.tests.unit.db.test_migrations.TestNovaMigrationsSQLite.test_models_sync Captured traceback: ~~~ Traceback (most recent call last): File /home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_migrations.py, line 588, in test_models_sync Models and migration scripts aren't in sync:\n%s % msg) File /home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/unittest2/case.py, line 690, in fail raise self.failureException(msg) AssertionError: Models and migration scripts aren't in sync: [ ( 'add_constraint', UniqueConstraint(Column('host', String(length=255), table=compute_nodes), Column('hypervisor_hostname', String(length=255), table=compute_nodes), Column('deleted', Integer(), table=compute_nodes, default=ColumnDefault(0] This appears to be an interaction between two things, the change I made to alter the vcpus_used column, and a previous change (commit 2db4a1ac, migration version 279) to add the uniq_compute_nodes0host0hypervisor_hostname constraint. sqlite doesn't support altering columns, so there's a workaround that makes a new column, copies the contents over, and deletes the old one...this seems to be running into problems with the modified constraint. I suspect that this means that anyone wanting to change the type of a column in the compute_nodes table will run into a similar problem. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1482416/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1471997] [NEW] nova MAX_FUNC value in nova/pci/devspec.py is too low
Public bug reported: The MAX_FUNC value in nova/pci/devspec.py is set to 0x7. This limits us to a relatively small number of VFs per PF, which is annoying when trying to use SRIOV in any sort of serious way. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: New ** Tags: compute ** Tags added: compute ** Changed in: nova Assignee: (unassigned) = Chris Friesen (cbf123) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1471997 Title: nova MAX_FUNC value in nova/pci/devspec.py is too low Status in OpenStack Compute (Nova): New Bug description: The MAX_FUNC value in nova/pci/devspec.py is set to 0x7. This limits us to a relatively small number of VFs per PF, which is annoying when trying to use SRIOV in any sort of serious way. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1471997/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1461678] [NEW] nova error handling causes glance to keep unlinked files open, wasting space
Public bug reported: When creating larger glance images (like a 10GB CentOS7 image), if we run into situation where we run out of room on the destination device, we cannot recover the space from glance. glance-api will have open unlinked files, so a TONNE of space is unavailable until we restart glance-api. Nova will try to reschedule the instance 3 times, so should see this nova-conductor.log : u'RescheduledException: Build of instance 98ca2c0d-44b2-48a6-b1af-55f4b2db73c1 was re-scheduled: [Errno 28] No space left on device\n'] The problem is this code in nova.image.glance.GlanceImageService.download(): if data is None: return image_chunks else: try: for chunk in image_chunks: data.write(chunk) finally: if close_file: data.close() image_chunks is an iterator. If we take an exception (like we can't write the file because the filesystem is full) then we will stop iterating over the chunks. If we don't iterate over all the chunks then glance will keep the file open. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: New ** Tags: compute ** Changed in: nova Assignee: (unassigned) = Chris Friesen (cbf123) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1461678 Title: nova error handling causes glance to keep unlinked files open, wasting space Status in OpenStack Compute (Nova): New Bug description: When creating larger glance images (like a 10GB CentOS7 image), if we run into situation where we run out of room on the destination device, we cannot recover the space from glance. glance-api will have open unlinked files, so a TONNE of space is unavailable until we restart glance-api. Nova will try to reschedule the instance 3 times, so should see this nova-conductor.log : u'RescheduledException: Build of instance 98ca2c0d-44b2-48a6-b1af-55f4b2db73c1 was re-scheduled: [Errno 28] No space left on device\n'] The problem is this code in nova.image.glance.GlanceImageService.download(): if data is None: return image_chunks else: try: for chunk in image_chunks: data.write(chunk) finally: if close_file: data.close() image_chunks is an iterator. If we take an exception (like we can't write the file because the filesystem is full) then we will stop iterating over the chunks. If we don't iterate over all the chunks then glance will keep the file open. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1461678/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1459782] [NEW] _is_storage_shared_with() in libvirt/driver.py gives possibly false results if ssh keys not configured
Public bug reported: In virt.libvirt.driver.LibvirtDriver._is_storage_shared_with() we first check IP addresses and if they don't match then we'll try to use ssh to check whether the storage is actually shared or not. If ssh keys are not set up between the compute nodes for the user running nova-compute then the call to utils.ssh_execute() will fail and we will return wrong information. utils.ssh_execute() is also used in _cleanup_remote_migration() and migrate_disk_and_power_off() and would suffer from similar issues there. Either we need to ensure that the requirement for pre-sharing the ssh keys is clearly documented, or we need to convert these to to use RPC calls like live migration. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1459782 Title: _is_storage_shared_with() in libvirt/driver.py gives possibly false results if ssh keys not configured Status in OpenStack Compute (Nova): New Bug description: In virt.libvirt.driver.LibvirtDriver._is_storage_shared_with() we first check IP addresses and if they don't match then we'll try to use ssh to check whether the storage is actually shared or not. If ssh keys are not set up between the compute nodes for the user running nova-compute then the call to utils.ssh_execute() will fail and we will return wrong information. utils.ssh_execute() is also used in _cleanup_remote_migration() and migrate_disk_and_power_off() and would suffer from similar issues there. Either we need to ensure that the requirement for pre-sharing the ssh keys is clearly documented, or we need to convert these to to use RPC calls like live migration. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1459782/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1458122] [NEW] nova shouldn't error if we can't schedule all of max_count instances at boot time
Public bug reported: When booting up instances, nova allows the user to specify a min count and a max count. Currently, if the user has quota space for max count instances, then nova will try to create them all. If any of them can't be scheduled, then the creation of all of them will be aborted and they will all be put into an error state. Arguably, if nova was able to schedule at least min count instances (which defaults to 1) then it should continue on with creating those instances that it was able to schedule. Only if nova cannot create at least min count instances should nova actually consider the request as failed. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1458122 Title: nova shouldn't error if we can't schedule all of max_count instances at boot time Status in OpenStack Compute (Nova): New Bug description: When booting up instances, nova allows the user to specify a min count and a max count. Currently, if the user has quota space for max count instances, then nova will try to create them all. If any of them can't be scheduled, then the creation of all of them will be aborted and they will all be put into an error state. Arguably, if nova was able to schedule at least min count instances (which defaults to 1) then it should continue on with creating those instances that it was able to schedule. Only if nova cannot create at least min count instances should nova actually consider the request as failed. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1458122/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1454451] [NEW] simultaneous boot of multiple instances leads to cpu pinning overlap
Public bug reported: I'm running into an issue with kilo-3 that I think is present in current trunk. I think there is a race between the claimed CPUs of an instance being persisted to the DB, and the resource audit scanning the DB for instances and subtracting pinned CPUs from the list of available CPUs. The problem only shows up when the following sequence happens: 1) instance A (with dedicated cpus) boots on a compute node 2) resource audit runs on that compute node 3) instance B (with dedicated cpus) boots on the same compute node So you need to either be booting many instances, or limiting the valid compute nodes (host aggregate or server groups), or have a small cluster in order to hit this. The nitty-gritty view looks like this: When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in compute.resource_tracker.ResourceTracker.instance_claim() and that covers updating the resource usage on the compute node. But we don't persist the instance numa topology to the database until after instance_claim() returns, in compute.manager.ComputeManager._build_instance(). Note that this is done *after* we've given up the semaphore, so there is no longer any sort of ordering guarantee. compute.resource_tracker.ResourceTracker.update_available_resource() then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a list of instances and uses that to calculate a new view of what resources are available. If the numa topology of the most recent instance hasn't been persisted yet, then the new view of resources won't include any pCPUs pinned by that instance. compute.manager.ComputeManager._build_instance() runs for the next instance and based on the new view of available resources it allocates the same pCPU(s) used by the earlier instance. Boom, overlapping pinned pCPUs. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1454451 Title: simultaneous boot of multiple instances leads to cpu pinning overlap Status in OpenStack Compute (Nova): New Bug description: I'm running into an issue with kilo-3 that I think is present in current trunk. I think there is a race between the claimed CPUs of an instance being persisted to the DB, and the resource audit scanning the DB for instances and subtracting pinned CPUs from the list of available CPUs. The problem only shows up when the following sequence happens: 1) instance A (with dedicated cpus) boots on a compute node 2) resource audit runs on that compute node 3) instance B (with dedicated cpus) boots on the same compute node So you need to either be booting many instances, or limiting the valid compute nodes (host aggregate or server groups), or have a small cluster in order to hit this. The nitty-gritty view looks like this: When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in compute.resource_tracker.ResourceTracker.instance_claim() and that covers updating the resource usage on the compute node. But we don't persist the instance numa topology to the database until after instance_claim() returns, in compute.manager.ComputeManager._build_instance(). Note that this is done *after* we've given up the semaphore, so there is no longer any sort of ordering guarantee. compute.resource_tracker.ResourceTracker.update_available_resource() then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a list of instances and uses that to calculate a new view of what resources are available. If the numa topology of the most recent instance hasn't been persisted yet, then the new view of resources won't include any pCPUs pinned by that instance. compute.manager.ComputeManager._build_instance() runs for the next instance and based on the new view of available resources it allocates the same pCPU(s) used by the earlier instance. Boom, overlapping pinned pCPUs. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1454451/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1444171] [NEW] evacuate code path is not updating binding:host_id in neutron
Public bug reported: Currently when using neutron we don't update the binding:host_id during the evacuate code path. This can cause the evacuation to fail if we go to sleep waiting for events in virt.libvirt.driver.LibvirtDriver._create_domain_and_network(). Since the binding:host_id in neutron is still pointing to the old host, neutron will never send us any events and we will eventually time out. I was able to get the evacuation to update the binding by modifying compute.manager.ComputeManager.rebuild_instance() to add a call to self.network_api.setup_instance_network_on_host() right below the existing call to self.network_api.setup_networks_on_host(). I'm not sure this solution would work for nova network though. It's a bit confusing, currently the networking API has three routines that seem to overlap: setup_networks_on_host() -- this actually does setup or teardown, and is empty for neutron setup_instance_network_on_host() -- this maps to self._update_port_binding_for_instance() for neutron. For nova network it maps to self.migrate_instance_finish() but that doesn't actually seem to do anything. cleanup_instance_network_on_host() -- this is empty for neutron. It maps to self.migrate_instance_start for nova network, but that doesn't actually seem to do anything. It seems like for neutron we use setup_instance_network_on_host() and for nova-network we use setup_networks_on_host() and the rest are not actually used. ** Affects: nova Importance: Undecided Status: New ** Tags: compute network -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1444171 Title: evacuate code path is not updating binding:host_id in neutron Status in OpenStack Compute (Nova): New Bug description: Currently when using neutron we don't update the binding:host_id during the evacuate code path. This can cause the evacuation to fail if we go to sleep waiting for events in virt.libvirt.driver.LibvirtDriver._create_domain_and_network(). Since the binding:host_id in neutron is still pointing to the old host, neutron will never send us any events and we will eventually time out. I was able to get the evacuation to update the binding by modifying compute.manager.ComputeManager.rebuild_instance() to add a call to self.network_api.setup_instance_network_on_host() right below the existing call to self.network_api.setup_networks_on_host(). I'm not sure this solution would work for nova network though. It's a bit confusing, currently the networking API has three routines that seem to overlap: setup_networks_on_host() -- this actually does setup or teardown, and is empty for neutron setup_instance_network_on_host() -- this maps to self._update_port_binding_for_instance() for neutron. For nova network it maps to self.migrate_instance_finish() but that doesn't actually seem to do anything. cleanup_instance_network_on_host() -- this is empty for neutron. It maps to self.migrate_instance_start for nova network, but that doesn't actually seem to do anything. It seems like for neutron we use setup_instance_network_on_host() and for nova-network we use setup_networks_on_host() and the rest are not actually used. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1444171/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1298513] Re: nova server group policy should be applied when resizing/migrating server
*** This bug is a duplicate of bug 1379451 *** https://bugs.launchpad.net/bugs/1379451 ** This bug has been marked a duplicate of bug 1379451 anti-affinity policy only honored on boot -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1298513 Title: nova server group policy should be applied when resizing/migrating server Status in OpenStack Compute (Nova): Confirmed Bug description: If I do the following: nova server-group-create --policy affinity affinitygroup nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros0 nova resize cirros0 2 The cirros0 server will be resized but when the scheduler runs it doesn't take into account the scheduling policy of the server group. I think the same would be true if we migrate the server. Lastly, when doing migration/evacuation and the user has specified the compute node we might want to validate the choice against the group policy. For emergencies we might want to allow policy violation with a --force option or something. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1298513/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1423648] [NEW] race conditions with server group scheduler policies
Public bug reported: In git commit a79ecbe Russel Bryant submitted a partial fix for a race condition when booting an instance as part of a server group with an anti-affinity scheduler policy. That fix only solves part of the problem, however. There are a number of issues remaining: 1) It's possible to hit a similar race condition for server groups with the affinity policy. Suppose we create a new group and then create two instances simultaneously. The scheduler sees an empty group for each, assigns them to different compute nodes, and the policy is violated. We should add a check in _validate_instance_group_policy() to cover the affinity case. 2) It's possible to create two instances simultaneously, have them be scheduled to conflicting hosts, both of them detect the problem in _validate_instance_group_policy(), both of them get sent back for rescheduling, and both of them get assigned to conflicting hosts *again*, resulting in an error. In order to fix this I propose that instead of checking against all other instances in the group, we only check against instances that were created before the current instance. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1423648 Title: race conditions with server group scheduler policies Status in OpenStack Compute (Nova): New Bug description: In git commit a79ecbe Russel Bryant submitted a partial fix for a race condition when booting an instance as part of a server group with an anti-affinity scheduler policy. That fix only solves part of the problem, however. There are a number of issues remaining: 1) It's possible to hit a similar race condition for server groups with the affinity policy. Suppose we create a new group and then create two instances simultaneously. The scheduler sees an empty group for each, assigns them to different compute nodes, and the policy is violated. We should add a check in _validate_instance_group_policy() to cover the affinity case. 2) It's possible to create two instances simultaneously, have them be scheduled to conflicting hosts, both of them detect the problem in _validate_instance_group_policy(), both of them get sent back for rescheduling, and both of them get assigned to conflicting hosts *again*, resulting in an error. In order to fix this I propose that instead of checking against all other instances in the group, we only check against instances that were created before the current instance. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1423648/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1420848] [NEW] nova-compute service spuriously marked as up when disabled
Public bug reported: I think our usage of the updated_at field to determine whether a service is up or not is buggy. Consider this scenario: 1) nova-compute is happily running and is up/enabled on compute-0 2) something causes nova-compute to stop (process crash, hardware fault, power failure, network isolation, etc.) 3) a minute later, the nova-compute service is reported as down 4) I run nova service-disable compute-0 nova-compute 5) nova-compute is now reported as up for the next minute I wonder if it would make sense to have a separate last status timestamp database field that would only get updated when we get a service status update and not when we change any other fields. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1420848 Title: nova-compute service spuriously marked as up when disabled Status in OpenStack Compute (Nova): New Bug description: I think our usage of the updated_at field to determine whether a service is up or not is buggy. Consider this scenario: 1) nova-compute is happily running and is up/enabled on compute-0 2) something causes nova-compute to stop (process crash, hardware fault, power failure, network isolation, etc.) 3) a minute later, the nova-compute service is reported as down 4) I run nova service-disable compute-0 nova-compute 5) nova-compute is now reported as up for the next minute I wonder if it would make sense to have a separate last status timestamp database field that would only get updated when we get a service status update and not when we change any other fields. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1420848/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1419115] [NEW] IndexError adding host to availability zone
Public bug reported: There appears to be a bug in the code dealing with adding a disabled host to an aggregate that is exported as an availability zone. I disable the nova-compute service on a host and then tried to add it to an aggregate that is exported as an availabilty zone. This resulted in the following error. File /usr/lib64/python2.7/site-packages/oslo/utils/excutils.py, line 82, in __exit__ six.reraise(self.type_, self.value, self.tb) File /usr/lib64/python2.7/site-packages/nova/exception.py, line 71, in wrapped return f(self, context, *args, **kw) File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3673, in add_host_to_aggregate aggregate=aggregate) File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3591, in is_safe_to_update_az host_az = host_azs.pop() IndexError: pop from empty list The code looks like this: if 'availability_zone' in metadata: _hosts = hosts or aggregate.hosts zones, not_zones = availability_zones.get_availability_zones( context, with_hosts=True) for host in _hosts: # NOTE(sbauza): Host can only be in one AZ, so let's take only # the first element host_azs = [az for (az, az_hosts) in zones if host in az_hosts and az != CONF.internal_service_availability_zone] host_az = host_azs.pop() It appears that for a disabled host, host_azs can be empty, resulting in an error when we try to pop() from it. It works fine if the service is enabled on the host, and it works fine if the service is diabled and I try to add the host to an aggregate that is not exported as an availability zone. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1419115 Title: IndexError adding host to availability zone Status in OpenStack Compute (Nova): New Bug description: There appears to be a bug in the code dealing with adding a disabled host to an aggregate that is exported as an availability zone. I disable the nova-compute service on a host and then tried to add it to an aggregate that is exported as an availabilty zone. This resulted in the following error. File /usr/lib64/python2.7/site-packages/oslo/utils/excutils.py, line 82, in __exit__ six.reraise(self.type_, self.value, self.tb) File /usr/lib64/python2.7/site-packages/nova/exception.py, line 71, in wrapped return f(self, context, *args, **kw) File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3673, in add_host_to_aggregate aggregate=aggregate) File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3591, in is_safe_to_update_az host_az = host_azs.pop() IndexError: pop from empty list The code looks like this: if 'availability_zone' in metadata: _hosts = hosts or aggregate.hosts zones, not_zones = availability_zones.get_availability_zones( context, with_hosts=True) for host in _hosts: # NOTE(sbauza): Host can only be in one AZ, so let's take only # the first element host_azs = [az for (az, az_hosts) in zones if host in az_hosts and az != CONF.internal_service_availability_zone] host_az = host_azs.pop() It appears that for a disabled host, host_azs can be empty, resulting in an error when we try to pop() from it. It works fine if the service is enabled on the host, and it works fine if the service is diabled and I try to add the host to an aggregate that is not exported as an availability zone. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1419115/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1417667] [NEW] migration/evacuation/rebuild/resize of instance with dedicated cpus needs to recalculate cpus on destination
Public bug reported: I'm running nova trunk, commit 752954a. I configured a flavor with two vcpus and extra specs hw:cpu_policy=dedicated in order to enable vcpu pinning. I booted up a number of instances such that there was one instance affined to host cpus 12 and 13 on compute-0, and another instance affined to cpus 12 and 13 on compute-2. (As reported by virsh vcpupin and virsh dumpxml.) I then triggered a live migration of one instance from compute-0 to compute-2. This resulted in both instances being affined to host cpus 12 and 13 on compute-2. The hw:cpu_policy=dedicated extra spec is intended to provide dedicated host cpus for the instance. In order to provide this, on a live migration (or cold migration, or rebuild, or evacuation, or resize, etc.) nova needs to ensure that the instance is affined to host cpus that are not currently being used by other instances. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1417667 Title: migration/evacuation/rebuild/resize of instance with dedicated cpus needs to recalculate cpus on destination Status in OpenStack Compute (Nova): New Bug description: I'm running nova trunk, commit 752954a. I configured a flavor with two vcpus and extra specs hw:cpu_policy=dedicated in order to enable vcpu pinning. I booted up a number of instances such that there was one instance affined to host cpus 12 and 13 on compute-0, and another instance affined to cpus 12 and 13 on compute-2. (As reported by virsh vcpupin and virsh dumpxml.) I then triggered a live migration of one instance from compute-0 to compute-2. This resulted in both instances being affined to host cpus 12 and 13 on compute-2. The hw:cpu_policy=dedicated extra spec is intended to provide dedicated host cpus for the instance. In order to provide this, on a live migration (or cold migration, or rebuild, or evacuation, or resize, etc.) nova needs to ensure that the instance is affined to host cpus that are not currently being used by other instances. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1417667/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1417671] [NEW] when using dedicated cpus, the emulator thread should be affined as well
Public bug reported: I'm running nova trunk, commit 752954a. I configured a flavor with two vcpus and extra specs hw:cpu_policy=dedicated in order to enable vcpu pinning. I booted up an instance with this flavor, and virsh dumpxml shows that the two vcpus were affined suitably to host cpus, but the emulator thread was left to float across the available host cores on that numa node. cputune shares2048/shares vcpupin vcpu='0' policy='other' priority='0' cpuset='4'/ vcpupin vcpu='1' policy='other' priority='0' cpuset='5'/ emulatorpin cpuset='3-11'/ /cputune Looking at the kvm process shortly after creation, we see quite a few emulator threads running with the emulatorpin affinity: compute-2:~$ taskset -apc 136143 pid 136143's current affinity list: 3-11 pid 136144's current affinity list: 0,3-24,27-47 pid 136146's current affinity list: 4 pid 136147's current affinity list: 5 pid 136149's current affinity list: 0 pid 136433's current affinity list: 3-11 pid 136434's current affinity list: 3-11 pid 136435's current affinity list: 3-11 pid 136436's current affinity list: 3-11 pid 136437's current affinity list: 3-11 pid 136438's current affinity list: 3-11 pid 136439's current affinity list: 3-11 pid 136440's current affinity list: 3-11 pid 136441's current affinity list: 3-11 pid 136442's current affinity list: 3-11 pid 136443's current affinity list: 3-11 pid 136444's current affinity list: 3-11 pid 136445's current affinity list: 3-11 pid 136446's current affinity list: 3-11 pid 136447's current affinity list: 3-11 pid 136448's current affinity list: 3-11 pid 136449's current affinity list: 3-11 pid 136450's current affinity list: 3-11 pid 136451's current affinity list: 3-11 pid 136452's current affinity list: 3-11 pid 136453's current affinity list: 3-11 pid 136454's current affinity list: 3-11 Since the purpose of hw:cpu_policy=dedicated is to provide a dedicated host CPU for each guest CPU, the libvirt emulatorpin cpuset for a given guest should be set to one (or possibly more) of the CPUs specified for that guest. Otherwise, any work done by the emulator threads could rob CPU time from another guest instance. Personally I'd like to see the emulator thread affined the same as guest vCPU 0 (we use guest vCPU0 as a maintenance processor while doing the real work on the other vCPUs), but an argument could be made that it should be affined to the logical OR of all the guest vCPU cpusets. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1417671 Title: when using dedicated cpus, the emulator thread should be affined as well Status in OpenStack Compute (Nova): New Bug description: I'm running nova trunk, commit 752954a. I configured a flavor with two vcpus and extra specs hw:cpu_policy=dedicated in order to enable vcpu pinning. I booted up an instance with this flavor, and virsh dumpxml shows that the two vcpus were affined suitably to host cpus, but the emulator thread was left to float across the available host cores on that numa node. cputune shares2048/shares vcpupin vcpu='0' policy='other' priority='0' cpuset='4'/ vcpupin vcpu='1' policy='other' priority='0' cpuset='5'/ emulatorpin cpuset='3-11'/ /cputune Looking at the kvm process shortly after creation, we see quite a few emulator threads running with the emulatorpin affinity: compute-2:~$ taskset -apc 136143 pid 136143's current affinity list: 3-11 pid 136144's current affinity list: 0,3-24,27-47 pid 136146's current affinity list: 4 pid 136147's current affinity list: 5 pid 136149's current affinity list: 0 pid 136433's current affinity list: 3-11 pid 136434's current affinity list: 3-11 pid 136435's current affinity list: 3-11 pid 136436's current affinity list: 3-11 pid 136437's current affinity list: 3-11 pid 136438's current affinity list: 3-11 pid 136439's current affinity list: 3-11 pid 136440's current affinity list: 3-11 pid 136441's current affinity list: 3-11 pid 136442's current affinity list: 3-11 pid 136443's current affinity list: 3-11 pid 136444's current affinity list: 3-11 pid 136445's current affinity list: 3-11 pid 136446's current affinity list: 3-11 pid 136447's current affinity list: 3-11 pid 136448's current affinity list: 3-11 pid 136449's current affinity list: 3-11 pid 136450's current affinity list: 3-11 pid 136451's current affinity list: 3-11 pid 136452's current affinity list: 3-11 pid 136453's current affinity list: 3-11 pid 136454's current affinity list: 3-11 Since the purpose of hw:cpu_policy=dedicated is to provide a dedicated host CPU for each guest CPU, the libvirt emulatorpin cpuset for a given guest should be set to one (or
[Yahoo-eng-team] [Bug 1417723] [NEW] when using dedicated cpus, the guest topology doesn't match the host
Public bug reported: According to http://specs.openstack.org/openstack/nova- specs/specs/juno/approved/virt-driver-cpu-pinning.html, the topology of the guest is set up as follows: In the absence of an explicit vCPU topology request, the virt drivers typically expose all vCPUs as sockets with 1 core and 1 thread. When strict CPU pinning is in effect the guest CPU topology will be setup to match the topology of the CPUs to which it is pinned. What I'm seeing is that when strict CPU pinning is in use the guest seems to be configuring multiple threads, even if the host doesn't have theading enabled. As an example, I set up a flavor with 2 vCPUs and enabled dedicated cpus. I then booted up an instance of this flavor on two separate compute nodes, one with hyperthreading enabled and one with hyperthreading disabled. In both cases, virsh dumpxml gave the following topology: topology sockets='1' cores='1' threads='2'/ When running on the system with hyperthreading disabled, this should presumably have been set to cores=2 threads=1. Taking this a bit further, even if hyperthreading is enabled on the host it would be more accurate to only specify multiple threads in the guest topology if the vCPUs are actually affined to multiple threads of the same host core. Otherwise it would be more accurate to specify the guest topology with multiple cores of one thread each. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1417723 Title: when using dedicated cpus, the guest topology doesn't match the host Status in OpenStack Compute (Nova): New Bug description: According to http://specs.openstack.org/openstack/nova- specs/specs/juno/approved/virt-driver-cpu-pinning.html, the topology of the guest is set up as follows: In the absence of an explicit vCPU topology request, the virt drivers typically expose all vCPUs as sockets with 1 core and 1 thread. When strict CPU pinning is in effect the guest CPU topology will be setup to match the topology of the CPUs to which it is pinned. What I'm seeing is that when strict CPU pinning is in use the guest seems to be configuring multiple threads, even if the host doesn't have theading enabled. As an example, I set up a flavor with 2 vCPUs and enabled dedicated cpus. I then booted up an instance of this flavor on two separate compute nodes, one with hyperthreading enabled and one with hyperthreading disabled. In both cases, virsh dumpxml gave the following topology: topology sockets='1' cores='1' threads='2'/ When running on the system with hyperthreading disabled, this should presumably have been set to cores=2 threads=1. Taking this a bit further, even if hyperthreading is enabled on the host it would be more accurate to only specify multiple threads in the guest topology if the vCPUs are actually affined to multiple threads of the same host core. Otherwise it would be more accurate to specify the guest topology with multiple cores of one thread each. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1417723/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1417201] [NEW] nova-scheduler exception when trying to use hugepages
Public bug reported: I'm trying to make use of huge pages as described in http://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented /virt-driver-large-pages.html. I'm running nova kilo as of Jan 27th. The other openstack services are juno. Libvirt is 1.2.8. I've allocated 1 2MB pages on a compute node. virsh capabilities on that node contains: topology cells num='2' cell id='0' memory unit='KiB'67028244/memory pages unit='KiB' size='4'16032069/pages pages unit='KiB' size='2048'5000/pages pages unit='KiB' size='1048576'1/pages ... cell id='1' memory unit='KiB'67108864/memory pages unit='KiB' size='4'16052224/pages pages unit='KiB' size='2048'5000/pages pages unit='KiB' size='1048576'1/pages I then restarted nova-compute, I set hw:mem_page_size=large on a flavor, and then tried to boot up an instance with that flavor. I got the error logs below in nova-scheduler. Is this a bug? Feb 2 16:23:10 controller-0 nova-scheduler Exception during message handling: Cannot load 'mempages' in the base class 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last): 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 134, in _dispatch_and_reply 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher incoming.message)) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 177, in _dispatch 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 123, in _do_dispatch 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py, line 139, in inner 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher return func(*args, **kwargs) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/scheduler/manager.py, line 86, in select_destinations 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher filter_properties) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py, line 67, in select_destinations 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher filter_properties) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py, line 138, in _schedule 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher filter_properties, index=num) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py, line 391, in get_filtered_hosts 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher hosts, filter_properties, index) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/filters.py, line 77, in get_filtered_objects 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher list_objs = list(objs) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/filters.py, line 43, in filter_all 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher if self._filter_one(obj, filter_properties): 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/scheduler/filters/__init__.py, line 27, in _filter_one 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher return self.host_passes(obj, filter_properties) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/scheduler/filters/numa_topology_filter.py, line 45, in host_passes 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher limits_topology=limits)) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/virt/hardware.py, line 1161, in numa_fit_instance_to_host 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher host_cell, instance_cell, limit_cell) 2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher File /usr/lib64/python2.7/site-packages/nova/virt/hardware.py, line 851, in _numa_fit_instance_cell 2015-02-02 16:23:10.746
[Yahoo-eng-team] [Bug 1410924] [NEW] instructions for rebuilding API samples are wrong
Public bug reported: The instructions in nova/tests/functional/api_samples/README.rst say to run GENERATE_SAMPLES=True tox -epy27 nova.tests.unit.integrated, but that path doesn't exist anymore. Running GENERATE_SAMPLES=True tox -e functional seems to work, but someone who knows more than me should double-check that. It looks like this was missed when a bunch of functional tests were moved into nova/tests/functional. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1410924 Title: instructions for rebuilding API samples are wrong Status in OpenStack Compute (Nova): New Bug description: The instructions in nova/tests/functional/api_samples/README.rst say to run GENERATE_SAMPLES=True tox -epy27 nova.tests.unit.integrated, but that path doesn't exist anymore. Running GENERATE_SAMPLES=True tox -e functional seems to work, but someone who knows more than me should double-check that. It looks like this was missed when a bunch of functional tests were moved into nova/tests/functional. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1410924/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1368917] [NEW] rpc core should abort a call() early if the connection is terminated before the timeout period expires
Public bug reported: As it stands, if a client issuing an RPC call() sends a message to the rabbitmq server, then the rabbitmq server does a failover the client will wait for the full RPC timeout period (60 seconds) even though new rabbitmq server has come up long before then and some connections have been reestablished. The RPC core should notice that the server has gone away and should notify any entities waiting for an RPC call() response so that they can error out early rather than waiting for the full RPC timeout period. This was detected on Havana, but it seems to apply to all other versions as well. ** Affects: nova Importance: Undecided Status: New ** Tags: oslo -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1368917 Title: rpc core should abort a call() early if the connection is terminated before the timeout period expires Status in OpenStack Compute (Nova): New Bug description: As it stands, if a client issuing an RPC call() sends a message to the rabbitmq server, then the rabbitmq server does a failover the client will wait for the full RPC timeout period (60 seconds) even though new rabbitmq server has come up long before then and some connections have been reestablished. The RPC core should notice that the server has gone away and should notify any entities waiting for an RPC call() response so that they can error out early rather than waiting for the full RPC timeout period. This was detected on Havana, but it seems to apply to all other versions as well. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1368917/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1368989] [NEW] service_update() should not set an RPC timeout longer than service.report_interval
Public bug reported: nova.servicegroup.drivers.db.DbDriver._report_state() is called every service.report_interval seconds from a timer in order to periodically report the service state. It calls self.conductor_api.service_update(). If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC call() to nova-conductor. If anything happens to the RPC server (failover, switchover, etc.) by default the RPC code will wait 60 seconds for a response (blocking the timer-based calling of _report_state() in the meantime). This is long enough to cause the status in the database to get old enough that other services consider this service to be down. Arguably, since we're going to call service_update( ) again in service.report_interval seconds there's no reason to wait the full 60 seconds. Instead, it would make sense to set the RPC timeout for the service_update() call to to something slightly less than service.report_interval seconds. I've also submitted a related bug report (https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of connection in general, but I expect that'll take a while to deal with while this particular case can be handled much more easily. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1368989 Title: service_update() should not set an RPC timeout longer than service.report_interval Status in OpenStack Compute (Nova): New Bug description: nova.servicegroup.drivers.db.DbDriver._report_state() is called every service.report_interval seconds from a timer in order to periodically report the service state. It calls self.conductor_api.service_update(). If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC call() to nova-conductor. If anything happens to the RPC server (failover, switchover, etc.) by default the RPC code will wait 60 seconds for a response (blocking the timer-based calling of _report_state() in the meantime). This is long enough to cause the status in the database to get old enough that other services consider this service to be down. Arguably, since we're going to call service_update( ) again in service.report_interval seconds there's no reason to wait the full 60 seconds. Instead, it would make sense to set the RPC timeout for the service_update() call to to something slightly less than service.report_interval seconds. I've also submitted a related bug report (https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of connection in general, but I expect that'll take a while to deal with while this particular case can be handled much more easily. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1368989/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1330744] [NEW] live migration is incorrectly comparing host cpu features
Public bug reported: Runnng Havana, we're seeing live migration fail when attempting to migrate from an Ivy-Bridge host to a Sandy-Bridge host. However, we're using the default kvm guest config which has a safe default virtual cpu with a subset of cpu features. /proc/cpuinfo from within the guest looks the same on both types of hosts. I think the problem is that when check_can_live_migrate_destination() calls _compare_cpu(), it's comparing the host CPUs. Instead, I think we should be comparing the guest CPU against the host CPU of the destination to make sure it's compatible. (Assuming that libvirt considers the qemu virtual cpu to be compatible with the host cpu.) ** Affects: nova Importance: Undecided Status: New ** Tags: compute libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1330744 Title: live migration is incorrectly comparing host cpu features Status in OpenStack Compute (Nova): New Bug description: Runnng Havana, we're seeing live migration fail when attempting to migrate from an Ivy-Bridge host to a Sandy-Bridge host. However, we're using the default kvm guest config which has a safe default virtual cpu with a subset of cpu features. /proc/cpuinfo from within the guest looks the same on both types of hosts. I think the problem is that when check_can_live_migrate_destination() calls _compare_cpu(), it's comparing the host CPUs. Instead, I think we should be comparing the guest CPU against the host CPU of the destination to make sure it's compatible. (Assuming that libvirt considers the qemu virtual cpu to be compatible with the host cpu.) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1330744/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1313967] [NEW] build_and_run_instance() appears to be dead code
Public bug reported: In nova/compute/manager.py, the code path build_and_run_instance() do_build_and_run_instance() _build_and_run_instance() seems to be dead code, used by nothing but the unit tests. It looks like the code that is actually being used is run_instance(). ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: New ** Changed in: nova Assignee: (unassigned) = Chris Friesen (cbf123) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1313967 Title: build_and_run_instance() appears to be dead code Status in OpenStack Compute (Nova): New Bug description: In nova/compute/manager.py, the code path build_and_run_instance() do_build_and_run_instance() _build_and_run_instance() seems to be dead code, used by nothing but the unit tests. It looks like the code that is actually being used is run_instance(). To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1313967/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1313967] Re: build_and_run_instance() appears to be dead code
Sorry for the noise, I started reading the code and realized that it was just taking a long time to transition over to the new function. ** Changed in: nova Status: New = Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1313967 Title: build_and_run_instance() appears to be dead code Status in OpenStack Compute (Nova): Invalid Bug description: In nova/compute/manager.py, the code path build_and_run_instance() do_build_and_run_instance() _build_and_run_instance() seems to be dead code, used by nothing but the unit tests. It looks like the code that is actually being used is run_instance(). To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1313967/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1311793] [NEW] wrap_instance_event() swallows return codes
Public bug reported: In compute/manager.py the function wrap_instance_event() just calls function(). This means that if it's used to decorate a function that returns a value, then the caller will never see the return code. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: New ** Changed in: nova Assignee: (unassigned) = Chris Friesen (cbf123) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1311793 Title: wrap_instance_event() swallows return codes Status in OpenStack Compute (Nova): New Bug description: In compute/manager.py the function wrap_instance_event() just calls function(). This means that if it's used to decorate a function that returns a value, then the caller will never see the return code. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1311793/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1298494] [NEW] nova server-group-list doesn't show members of the group
Public bug reported: With current devstack I ensured I had GroupAntiAffinityFilter in scheduler_default_filters in /etc/nova/nova.conf, restarted nova- scheduler, then ran: nova server-group-create --policy anti-affinity antiaffinitygroup nova server-group-list +--+---++-+--+ | Id | Name | Policies | Members | Metadata | +--+---++-+--+ | 5d639349-1b77-43df-b13f-ed586e73b3ac | antiaffinitygroup | [u'anti-affinity'] | [] | {} | +--+---++-+--+ nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=5d639349-1b77-43df-b13f-ed586e73b3ac cirros0 nova list +--+-+++-++ | ID | Name| Status | Task State | Power State | Networks | +--+-+++-++ | a7a3ec40-85d9-4b72-a522-d1c0684f3ada | cirros0 | ACTIVE | - | Running | private=10.4.128.2 | +--+-+++-++ Then I tried listing the groups, and it didn't print the newly-booted instance as a member: nova server-group-list +--+---++-+--+ | Id | Name | Policies | Members | Metadata | +--+---++-+--+ | 5d639349-1b77-43df-b13f-ed586e73b3ac | antiaffinitygroup | [u'anti-affinity'] | [] | {} | +--+---++-+--+ Rerunning the nova command with --debug we see that the problem is in nova, not novaclient: RESP BODY: {server_groups: [{members: [], metadata: {}, id: 5d639349-1b77-43df-b13f-ed586e73b3ac, policies: [anti-affinity], name: antiaffinitygroup}]} Looking at the database, we see that the instance is actually tracked as a member of the list (along with two other instances that haven't been marked as deleted yet, which is also a bug I think). mysql select * from instance_group_member; +-+++-++--+--+ | created_at | updated_at | deleted_at | deleted | id | instance_id | group_id | +-+++-++--+--+ | 2014-03-26 20:19:14 | NULL | NULL | 0 | 1 | d289502b-57fc-46f6-b39d-66a1db3a9ebc |1 | | 2014-03-26 20:25:04 | NULL | NULL | 0 | 2 | e07f1f15-4e93-4845-9203-bf928c196a78 |1 | | 2014-03-26 20:35:11 | NULL | NULL | 0 | 3 | a7a3ec40-85d9-4b72-a522-d1c0684f3ada |1 | +-+++-++--+--+ 3 rows in set (0.00 sec) ** Affects: nova Importance: Undecided Status: New ** Summary changed: - nova instance-group-list doesn't show members of the group + nova server-group-list doesn't show members of the group -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1298494 Title: nova server-group-list doesn't show members of the group Status in OpenStack Compute (Nova): New Bug description: With current devstack I ensured I had GroupAntiAffinityFilter in scheduler_default_filters in /etc/nova/nova.conf, restarted nova- scheduler, then ran: nova server-group-create --policy anti-affinity antiaffinitygroup nova server-group-list +--+---++-+--+ | Id | Name | Policies | Members | Metadata | +--+---++-+--+ | 5d639349-1b77-43df-b13f-ed586e73b3ac | antiaffinitygroup | [u'anti-affinity'] | [] | {} | +--+---++-+--+ nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=5d639349-1b77-43df-b13f-ed586e73b3ac cirros0 nova list +--+-+++-++ | ID
[Yahoo-eng-team] [Bug 1298509] [NEW] nova server-group-delete allows deleting server group with members
Public bug reported: Currently nova will let you do this: nova server-group-create --policy anti-affinity antiaffinitygroup nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros0 nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros1 nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros2 nova server-group-delete group_uuid Given that a server group is designed to logically group servers together, I don't think it makes sense to allow nova to delete a server group that currently has undeleted members in it. ** Affects: nova Importance: Undecided Status: New ** Tags: compute ** Tags added: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1298509 Title: nova server-group-delete allows deleting server group with members Status in OpenStack Compute (Nova): New Bug description: Currently nova will let you do this: nova server-group-create --policy anti-affinity antiaffinitygroup nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros0 nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros1 nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros2 nova server-group-delete group_uuid Given that a server group is designed to logically group servers together, I don't think it makes sense to allow nova to delete a server group that currently has undeleted members in it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1298509/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1298513] [NEW] nova server group policy should be applied when resizing/migrating server
Public bug reported: If I do the following: nova server-group-create --policy affinity affinitygroup nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros0 nova resize cirros0 2 The cirros0 server will be resized but when the scheduler runs it doesn't take into account the scheduling policy of the server group. I think the same would be true if we migrate the server. Lastly, when doing migration/evacuation and the user has specified the compute node we might want to validate the choice against the group policy. For emergencies we might want to allow policy violation with a --force option or something. ** Affects: nova Importance: Undecided Status: New ** Description changed: - If we try to resize/migrate a server that is part of a server group, the - server group policy should be applied when scheduling the server. + If I do the following: + + nova server-group-create --policy affinity affinitygroup + nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros0 + nova resize cirros0 2 + + The cirros0 server will be resized but when the scheduler runs it + doesn't take into account the scheduling policy of the server group. + + I think the same would be true if we migrate the server. + + Lastly, when doing migration/evacuation and the user has specified the + compute node we might want to validate the choice against the group + policy. For emergencies we might want to allow policy violation with a + --force option or something. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1298513 Title: nova server group policy should be applied when resizing/migrating server Status in OpenStack Compute (Nova): New Bug description: If I do the following: nova server-group-create --policy affinity affinitygroup nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid cirros0 nova resize cirros0 2 The cirros0 server will be resized but when the scheduler runs it doesn't take into account the scheduling policy of the server group. I think the same would be true if we migrate the server. Lastly, when doing migration/evacuation and the user has specified the compute node we might want to validate the choice against the group policy. For emergencies we might want to allow policy violation with a --force option or something. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1298513/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1298690] [NEW] sqlite regexp() function doesn't behave like mysql
Public bug reported: In bug 1298494 I recently saw a case where the unit tests (using sqlite) behaved differently than devstack with mysql. The issue seems to be when we do filters = {'uuid': group.members, 'deleted_at': None} instances = instance_obj.InstanceList.get_by_filters( context, filters=filters) Eventually down in db/sqlalchemy/api.py we end up calling query = query.filter(column_attr.op(db_regexp_op)( str(filters[filter_name]))) where str(filters[filter_name]) is the string 'None'. When using mysql, a regexp comparison of the string 'None' against a NULL field fails to match. Since sqlite doesn't have its own regexp function we provide one in openstack/common/db/sqlalchemy/session.py. In the buggy case we end up calling it as regexp('None', None), where the types are unicode and NoneType. However, we end up converting the second arg to text type before calling reg.search() on it, so it matches. This is a bug, we want the unit tests to behave like the real system. ** Affects: nova Importance: Undecided Status: New ** Tags: compute db ** Tags added: compute db -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1298690 Title: sqlite regexp() function doesn't behave like mysql Status in OpenStack Compute (Nova): New Bug description: In bug 1298494 I recently saw a case where the unit tests (using sqlite) behaved differently than devstack with mysql. The issue seems to be when we do filters = {'uuid': group.members, 'deleted_at': None} instances = instance_obj.InstanceList.get_by_filters( context, filters=filters) Eventually down in db/sqlalchemy/api.py we end up calling query = query.filter(column_attr.op(db_regexp_op)( str(filters[filter_name]))) where str(filters[filter_name]) is the string 'None'. When using mysql, a regexp comparison of the string 'None' against a NULL field fails to match. Since sqlite doesn't have its own regexp function we provide one in openstack/common/db/sqlalchemy/session.py. In the buggy case we end up calling it as regexp('None', None), where the types are unicode and NoneType. However, we end up converting the second arg to text type before calling reg.search() on it, so it matches. This is a bug, we want the unit tests to behave like the real system. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1298690/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1296967] [NEW] instances stuck with task_state of REBOOTING after controller switchover
Public bug reported: We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller: 1) We reboot the controller. 2) Right after it comes back up something calls compute.api.API.reboot() on an instance. 3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database. 4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast. 5) That message gets dropped on the floor due to communication issues between the controller and the compute. 6) Now we're stuck with a task_state of REBOOTING. Currently when doing a reboot we set the REBOOTING task_state in the database in compute-api and then send an RPC cast. That seems awfully risky given that if that message gets lost or the call fails for any reason we could end up stuck in the REBOOTING state forever. I think it might make sense to have the power state audit clear the REBOOTING state if appropriate, but others with more experience should make that call. It didn't happen to use, but I think we could get into this state another way: 1) nova-compute was running reboot_instance() 2) we reboot the controller 3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None. 4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state. The timeline that I have looks like this. We had some buggy code that sent all the instances for a reboot when the controller came up. The first two are in the controller logs below, and these are the ones that failed. controller: (running everything but nova-compute) nova-api log: /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1 status: 202 len: 185 time: 0.2299521 /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1 status: 202 len: 185 time: 0.1583798 After this there are other reboot requests for the other instances, and those ones passed. Interestingly, we later see this /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1 status: 409 len: 303 time: 0.1177511 /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1 status: 409 len: 303 time: 0.0366399 Presumably the 409 responses are because nova thinks that these instances are currently rebooting. compute: 2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.246 12229 INFO
[Yahoo-eng-team] [Bug 1296972] Re: RPC code in Havana doesn't handle connection errors
Looks like I misread that patch below, it's adding back the channel error check, not the connection error check. This may be due to a bad patch on our end, sorry for the noise. ** Changed in: nova Status: New = Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1296972 Title: RPC code in Havana doesn't handle connection errors Status in OpenStack Compute (Nova): Invalid Bug description: We've got an HA controller setup using pacemaker and were stress- testing it by doing multiple controlled switchovers while doing other activity. Generally this works okay, but last night we ran into a problem where nova-compute got into a state where it was unable to reconnect with the AMQP server. Logs are at the bottom, they repeat every minute and did this for 7+ hours until the system was manually cleaned up. I've found something in the code that looks a bit suspicious. The Unexpected exception occurred 61 time(s)... retrying. message comes from forever_retry_uncaught_exceptions() in excutils.py. It looks like we're raising RecoverableConnectionError: connection already closed down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, but nothing handles it. It looks like the most likely place that should be handling it is nova.openstack.common.rpc.impl_kombu.Connection.ensure(). In the current oslo.messaging code the ensure() routine explicitly handles connection errors (which RecoverableConnectionError is) and socket timeouts--the ensure() routine in Havana doesn't do this. Maybe we should look at porting https://github.com/openstack/oslo.messaging/commit/0400cbf4f83cf8d58076c7e65e08a156ec3508a8 to the Havana RPC code? Logs showing the start of the problem and the first few iterations of the repeating issue: 2014-03-24 09:24:33.566 6620 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources 2014-03-24 09:24:34.126 6620 INFO nova.compute.resource_tracker [-] DETAIL: instance: name=u'sgw-4', vm_state=u'active', task_state=None, vcpus=2, cpuset=0x180, cpulist=[7, 8] pinned, nodelist=[0], node=0 2014-03-24 09:24:34.126 6620 INFO nova.compute.resource_tracker [-] DETAIL: instance: name=u'sgw-1', vm_state=u'active', task_state=None, vcpus=2, cpuset=0x60, cpulist=[5, 6] pinned, nodelist=[0], node=0 2014-03-24 09:24:34.126 6620 INFO nova.compute.resource_tracker [-] DETAIL: instance: name=u'load_balancer', vm_state=u'active', task_state=None, vcpus=3, cpuset=0x1c00, cpulist=[10, 11, 12] pinned, nodelist=[1], node=1 2014-03-24 09:24:34.182 6620 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 111290, per-node: [52286, 59304], numa nodes:2 2014-03-24 09:24:34.183 6620 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 29 2014-03-24 09:24:34.183 6620 AUDIT nova.compute.resource_tracker [-] Free vcpus: 170, free per-node float vcpus: [48, 112], free per-node pinned vcpus: [3, 7] 2014-03-24 09:24:34.183 6620 INFO nova.compute.resource_tracker [-] DETAIL: vcpus:20, Free vcpus:170, 16.0x overcommit, per-cpu float cpulist: [3, 4, 9, 13, 14, 15, 16, 17, 18, 19] 2014-03-24 09:24:34.244 6620 INFO nova.compute.resource_tracker [-] Compute_service record updated for compute-0:compute-0 2014-03-24 09:25:36.564 6620 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources 2014-03-24 09:25:37.122 6620 INFO nova.compute.resource_tracker [-] DETAIL: instance: name=u'sgw-4', vm_state=u'active', task_state=None, vcpus=2, cpuset=0x180, cpulist=[7, 8] pinned, nodelist=[0], node=0 2014-03-24 09:25:37.122 6620 INFO nova.compute.resource_tracker [-] DETAIL: instance: name=u'sgw-1', vm_state=u'active', task_state=None, vcpus=2, cpuset=0x60, cpulist=[5, 6] pinned, nodelist=[0], node=0 2014-03-24 09:25:37.122 6620 INFO nova.compute.resource_tracker [-] DETAIL: instance: name=u'load_balancer', vm_state=u'active', task_state=None, vcpus=3, cpuset=0x1c00, cpulist=[10, 11, 12] pinned, nodelist=[1], node=1 2014-03-24 09:25:37.182 6620 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 111290, per-node: [52286, 59304], numa nodes:2 2014-03-24 09:25:37.182 6620 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 29 2014-03-24 09:25:37.183 6620 AUDIT nova.compute.resource_tracker [-] Free vcpus: 170, free per-node float vcpus: [48, 112], free per-node pinned vcpus: [3, 7] 2014-03-24 09:25:37.183 6620 INFO nova.compute.resource_tracker [-] DETAIL: vcpus:20, Free vcpus:170, 16.0x overcommit, per-cpu float cpulist: [3, 4, 9, 13, 14, 15, 16, 17, 18, 19] 2014-03-24 09:25:37.245 6620 INFO nova.compute.resource_tracker [-] Compute_service record updated for compute-0:compute-0 2014-03-24 09:26:47.324 6620 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2014-03-24 09:26:47.324
[Yahoo-eng-team] [Bug 1294756] [NEW] missing test for None in sqlalchemy query filter
Public bug reported: In db.sqlalchemy.api.instance_get_all_by_filters() there is code that looks like this: if not filters.pop('soft_deleted', False): query_prefix = query_prefix.\ filter(models.Instance.vm_state != vm_states.SOFT_DELETED) In sqlalchemy a comparison against a non-null value will not match null values, so the above filter will not return objects where vm_state is NULL. The problem is that in the Instance object the vm_state field is declared as nullable. In many cases vm_state will in fact have a value, but in get_test_instance() in test/utils.py the value of vm_state is not specified. Given the above, it seems that either we need to configure models.Instance.vm_state as not nullable (and deal with the fallout), or else we need to update instance_get_all_by_filters() to explicitly check for None--something like this perhaps: if not filters.pop('soft_deleted', False): query_prefix = query_prefix.\ filter(or_(models.Instance.vm_state != vm_states.SOFT_DELETED, models.Instance.vm_state == None)) If we want to fix the query, I'll happily submit the updated code. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1294756 Title: missing test for None in sqlalchemy query filter Status in OpenStack Compute (Nova): New Bug description: In db.sqlalchemy.api.instance_get_all_by_filters() there is code that looks like this: if not filters.pop('soft_deleted', False): query_prefix = query_prefix.\ filter(models.Instance.vm_state != vm_states.SOFT_DELETED) In sqlalchemy a comparison against a non-null value will not match null values, so the above filter will not return objects where vm_state is NULL. The problem is that in the Instance object the vm_state field is declared as nullable. In many cases vm_state will in fact have a value, but in get_test_instance() in test/utils.py the value of vm_state is not specified. Given the above, it seems that either we need to configure models.Instance.vm_state as not nullable (and deal with the fallout), or else we need to update instance_get_all_by_filters() to explicitly check for None--something like this perhaps: if not filters.pop('soft_deleted', False): query_prefix = query_prefix.\ filter(or_(models.Instance.vm_state != vm_states.SOFT_DELETED, models.Instance.vm_state == None)) If we want to fix the query, I'll happily submit the updated code. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1294756/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1292963] [NEW] postgres incompatibility in InstanceGroup.get_hosts()
Public bug reported: When running InstanceGroup.get_hosts() on a havana installation that uses postgres I get the following error: RemoteError: Remote error: ProgrammingError (ProgrammingError) operator does not exist: timestamp without time zone ~ unknown 2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 83439206-3a88-495b-b6c7-6aea1287109f] LINE 3: uuid != instances.uuid AND (instances.deleted_at ~ 'None') ... 2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 83439206-3a88-495b-b6c7-6aea1287109f]^ 2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 83439206-3a88-495b-b6c7-6aea1287109f] HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. I'm not a database expert, but after doing some digging, it seems that the problem is this line in get_hosts(): filters = {'uuid': filter_uuids, 'deleted_at': None} It seems that current postgres doesn't allow implicit casts. If I change the line to: filters = {'uuid': filter_uuids, 'deleted': 0} Then it seems to work. ** Affects: nova Importance: Undecided Assignee: Chris Friesen (cbf123) Status: In Progress ** Changed in: nova Assignee: (unassigned) = Chris Friesen (cbf123) ** Changed in: nova Status: New = In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1292963 Title: postgres incompatibility in InstanceGroup.get_hosts() Status in OpenStack Compute (Nova): In Progress Bug description: When running InstanceGroup.get_hosts() on a havana installation that uses postgres I get the following error: RemoteError: Remote error: ProgrammingError (ProgrammingError) operator does not exist: timestamp without time zone ~ unknown 2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 83439206-3a88-495b-b6c7-6aea1287109f] LINE 3: uuid != instances.uuid AND (instances.deleted_at ~ 'None') ... 2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 83439206-3a88-495b-b6c7-6aea1287109f]^ 2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 83439206-3a88-495b-b6c7-6aea1287109f] HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. I'm not a database expert, but after doing some digging, it seems that the problem is this line in get_hosts(): filters = {'uuid': filter_uuids, 'deleted_at': None} It seems that current postgres doesn't allow implicit casts. If I change the line to: filters = {'uuid': filter_uuids, 'deleted': 0} Then it seems to work. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1292963/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1289064] [NEW] live migration of instance should claim resources on target compute node
Public bug reported: I'm looking at the current Icehouse code, but this applies to previous versions as well. When we create a new instance via _build_instance() or _build_and_run_instance(), in both cases we call instance_claim() to test for resources and reserve them. During a cold migration we call prep_resize() which calls resize_claim() to reserve resources. However, when we live-migrate or evacuate an instance we don't do this. As far as I can see the current code will just spawn the new instance but the resource usage won't be updated until the audit runs at some unknown time in the future at which point it will add the new instance to self.tracked_instances and update the resource usage. This means that until the audit runs the scheduler has a stale view of system resources. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1289064 Title: live migration of instance should claim resources on target compute node Status in OpenStack Compute (Nova): New Bug description: I'm looking at the current Icehouse code, but this applies to previous versions as well. When we create a new instance via _build_instance() or _build_and_run_instance(), in both cases we call instance_claim() to test for resources and reserve them. During a cold migration we call prep_resize() which calls resize_claim() to reserve resources. However, when we live-migrate or evacuate an instance we don't do this. As far as I can see the current code will just spawn the new instance but the resource usage won't be updated until the audit runs at some unknown time in the future at which point it will add the new instance to self.tracked_instances and update the resource usage. This means that until the audit runs the scheduler has a stale view of system resources. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1289064/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp