[Yahoo-eng-team] [Bug 1825386] [NEW] nova is looking for OVMF file no longer provided by latest CentOS

2019-04-18 Thread Chris Friesen
Public bug reported:

In nova/virt/libvirt/driver.py the code looks for a hardcoded path
"/usr/share/OVMF/OVMF_CODE.fd".

It appears that centos 7.6 has modified the OVMF-20180508-3 rpm to no
longer contain this file.  Instead it now seems to be named
/usr/share/OVMF/OVMF_CODE.secboot.fd

This will break the ability to boot guests using UEFI.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1825386

Title:
  nova is looking for OVMF file no longer provided by latest CentOS

Status in OpenStack Compute (nova):
  New

Bug description:
  In nova/virt/libvirt/driver.py the code looks for a hardcoded path
  "/usr/share/OVMF/OVMF_CODE.fd".

  It appears that centos 7.6 has modified the OVMF-20180508-3 rpm to no
  longer contain this file.  Instead it now seems to be named
  /usr/share/OVMF/OVMF_CODE.secboot.fd

  This will break the ability to boot guests using UEFI.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1825386/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1819216] [NEW] in devstack, "nova migrate " will try to migrate to the same host (and then fail)

2019-03-08 Thread Chris Friesen
Public bug reported:

In multinode devstack I had an instance running on one node and tried
running "nova migrate ".  The operation started, but then the
instance went into an error state with the following fault:

{"message": "Unable to migrate instance (2bbdab8e-
3a83-43a4-8c47-ce57b653e43e) to current host (fedora-1.novalocal).",
"code": 400, "created": "2019-03-08T19:59:09Z"}

Logically, I think that even if "resize to same host" is enabled, for a
"migrate" operation we should remove the current host from
consideration.  We know it's going to fail, and it doesn't make sense
anyways.

Also, it would probably make sense to make "migrate" work like "live
migration" which removes the current host from consideration.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

** Summary changed:

- in devstack, "nova migrate " can try to migrate to the same host
+ in devstack, "nova migrate " will try to migrate to the same host (and 
then fail)

** Description changed:

  In multinode devstack I had an instance running on one node and tried
  running "nova migrate ".  The operation started, but then the
  instance went into an error state with the following fault:
  
  {"message": "Unable to migrate instance (2bbdab8e-
  3a83-43a4-8c47-ce57b653e43e) to current host (fedora-1.novalocal).",
  "code": 400, "created": "2019-03-08T19:59:09Z"}
  
  Logically, I think that even if "resize to same host" is enabled, for a
  "migrate" operation we should remove the current host from
  consideration.  We know it's going to fail, and it doesn't make sense
  anyways.
+ 
+ Also, it would probably make sense to make "migrate" work like "live
+ migration" which removes the current host from consideration.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1819216

Title:
  in devstack, "nova migrate " will try to migrate to the same
  host (and then fail)

Status in OpenStack Compute (nova):
  New

Bug description:
  In multinode devstack I had an instance running on one node and tried
  running "nova migrate ".  The operation started, but then the
  instance went into an error state with the following fault:

  {"message": "Unable to migrate instance (2bbdab8e-
  3a83-43a4-8c47-ce57b653e43e) to current host (fedora-1.novalocal).",
  "code": 400, "created": "2019-03-08T19:59:09Z"}

  Logically, I think that even if "resize to same host" is enabled, for
  a "migrate" operation we should remove the current host from
  consideration.  We know it's going to fail, and it doesn't make sense
  anyways.

  Also, it would probably make sense to make "migrate" work like "live
  migration" which removes the current host from consideration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1819216/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1818701] [NEW] invalid PCI alias in flavor results in HTTP 500 on instance create

2019-03-05 Thread Chris Friesen
Public bug reported:

If an invalid PCI alias is specified in the flavor extra-specs and we
try to create an instance with that flavor, it will result in a
PciInvalidAlias exception being raised.

In ServersController.create() PciInvalidAlias is missing from the list
of exceptions that get converted to an HTTPBadRequest.  Instead, it's
reported as a 500 error:

[stack@fedora-1 nova]$ nova boot --flavor  ds2G --image fedora29 --nic none 
--admin-pass fedora asdf3
ERROR (ClientException): Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
 (HTTP 500) (Request-ID: 
req-fec3face-4135-41fd-bc48-07957363ddae)

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: api

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1818701

Title:
  invalid PCI alias in flavor results in HTTP 500 on instance create

Status in OpenStack Compute (nova):
  New

Bug description:
  If an invalid PCI alias is specified in the flavor extra-specs and we
  try to create an instance with that flavor, it will result in a
  PciInvalidAlias exception being raised.

  In ServersController.create() PciInvalidAlias is missing from the list
  of exceptions that get converted to an HTTPBadRequest.  Instead, it's
  reported as a 500 error:

  [stack@fedora-1 nova]$ nova boot --flavor  ds2G --image fedora29 --nic none 
--admin-pass fedora asdf3
  ERROR (ClientException): Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
   (HTTP 500) (Request-ID: 
req-fec3face-4135-41fd-bc48-07957363ddae)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1818701/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1818092] [NEW] hypervisor check in _check_instance_has_no_numa() is broken

2019-02-28 Thread Chris Friesen
Public bug reported:

In commit ae2e5650d "Fail to live migration if instance has a NUMA
topology" there is a check against hypervisor_type.

Unfortunately it tests against the value "obj_fields.HVType.KVM".  Even
when KVM is supported by qemu the libvirt driver will still report the
hypervisor type as "QEMU". So we need to fix up the hypervisor type
check otherwise we'll always fail the check.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: In Progress


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1818092

Title:
  hypervisor check in _check_instance_has_no_numa() is broken

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  In commit ae2e5650d "Fail to live migration if instance has a NUMA
  topology" there is a check against hypervisor_type.

  Unfortunately it tests against the value "obj_fields.HVType.KVM".
  Even when KVM is supported by qemu the libvirt driver will still
  report the hypervisor type as "QEMU". So we need to fix up the
  hypervisor type check otherwise we'll always fail the check.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1818092/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1815949] [NEW] missing special-case libvirt exception during device detach

2019-02-14 Thread Chris Friesen
Public bug reported:

In Pike a customer has run into the following issue:

2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall [-] Dynamic 
interval looping call 'oslo_service.loopingcall._func' failed: libvirtError: 
internal error: unable to execute QEMU command 'device_del': Device 
'virtio-disk15' not found
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall Traceback (most 
recent call last):
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 143, in 
_run_loop
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall result = 
func(*self.args, **self.kw)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 363, in 
_func
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall result = 
f(*args, **kwargs)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 505, in 
_do_wait_and_retry_detach
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall 
_try_detach_device(config, persistent=False, host=host)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 467, in 
_try_detach_device
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall 
device=alternative_device_name)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall 
self.force_reraise()
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall 
six.reraise(self.type_, self.value, self.tb)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 451, in 
_try_detach_device
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall 
self.detach_device(conf, persistent=persistent, live=live)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 530, in 
detach_device
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall 
self._domain.detachDeviceFlags(device_xml, flags=flags)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall result = 
proxy_call(self._autowrap, f, *args, **kwargs)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall rv = 
execute(f, *args, **kwargs)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall six.reraise(c, 
e, tb)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall rv = 
meth(*args, **kwargs)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall   File 
"/usr/lib64/python2.7/site-packages/libvirt.py", line 1217, in detachDeviceFlags
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall if ret == -1: 
raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self)
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall libvirtError: 
internal error: unable to execute QEMU command 'device_del': Device 
'virtio-disk15' not found
2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall


Based on discussion with Melanie Witt, it seems likely that nova is missing a 
special-case in Guest.detach_device_with_retry().  It seems likely we need to 
modify the conditional at line 409 of virt/libvirt/guest.py to look like 'if 
errcode in (libvirt.VIR_ERR_OPERATION_FAILED, libvirt.VIR_ERR_INTERNAL_ERROR):'

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1815949

Title:
  missing special-case libvirt exception during device detach

Status in OpenStack Compute (nova):
  New

Bug description:
  In Pike a customer has run into the following issue:

  2019-02-12 07:34:43.728 23425 ERROR oslo.service.loopingcall [-] Dynamic 
interval looping call 'oslo_service.loopingcall._func' failed: libvirtError: 

[Yahoo-eng-team] [Bug 1792985] [NEW] strict NUMA memory allocation for 4K pages leads to OOM-killer

2018-09-17 Thread Chris Friesen
Public bug reported:

We've seen a case on a resource-constrained compute node where booting
multiple instances passed, but led to the following error messages from
the host kernel:

[ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or 
sacrifice child
[ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, 
anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB

The problem appears to be that currently with libvirt an instance which
does not specify a NUMA topology (which implies "shared" CPUs and the
default memory pagesize) is allowed to float across the whole compute
node.  As such, we do not know which host NUMA node its memory is going
to be allocated from, and therefore we don't know how much memory is
remaining on each host NUMA node.

If we have a similar instance which *is* limited to a particular NUMA
node (due to adding a PCI device for example, or in the future by
specifying dedicated CPUs) then that allocation will currently use
"strict" NUMA affinity.  This allocation can fail if there isn't enough
memory available on that NUMA node (due to being "stolen" by a floating
instance, for example).

I think this means that we cannot use "strict" affinity for the default
page size even when we do have a numa_topology since we can't have
accurate per-NUMA-node accounting due to the fact that we don't know
which NUMA node floating instances allocated their memory from.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1792985

Title:
  strict NUMA memory allocation for 4K pages leads to OOM-killer

Status in OpenStack Compute (nova):
  New

Bug description:
  We've seen a case on a resource-constrained compute node where booting
  multiple instances passed, but led to the following error messages
  from the host kernel:

  [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or 
sacrifice child
  [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, 
anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB

  The problem appears to be that currently with libvirt an instance
  which does not specify a NUMA topology (which implies "shared" CPUs
  and the default memory pagesize) is allowed to float across the whole
  compute node.  As such, we do not know which host NUMA node its memory
  is going to be allocated from, and therefore we don't know how much
  memory is remaining on each host NUMA node.

  If we have a similar instance which *is* limited to a particular NUMA
  node (due to adding a PCI device for example, or in the future by
  specifying dedicated CPUs) then that allocation will currently use
  "strict" NUMA affinity.  This allocation can fail if there isn't
  enough memory available on that NUMA node (due to being "stolen" by a
  floating instance, for example).

  I think this means that we cannot use "strict" affinity for the
  default page size even when we do have a numa_topology since we can't
  have accurate per-NUMA-node accounting due to the fact that we don't
  know which NUMA node floating instances allocated their memory from.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1792985/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1792077] [NEW] problem specifying multiple "bus=scsi" block devices on nova boot

2018-09-11 Thread Chris Friesen
Public bug reported:

I'm using devstack stable/rocky on ubuntu 16.04.

When running this command

nova boot --flavor m1.small --nic net-name=public --block-device
source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove,bus=scsi
--block-device
source=blank,dest=volume,size=2,bootindex=1,shutdown=remove,bus=scsi
--poll twovol

the instance fails to boot with the error:

libvirtError: unsupported configuration: Found duplicate drive address
for disk with target name 'sda' controller='0' bus='0' target='0'
unit='0'


For some background information, this works:

nova boot --flavor m1.small --nic net-name=public --block-device
source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove,bus=scsi
--poll onevol

It also works if I have two block devices but don't specify "bus=scsi":

nova boot --flavor m1.small --nic net-name=public --block-device
source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove
--block-device
source=blank,dest=volume,size=2,bootindex=1,shutdown=remove --poll
twovolnoscsi

This maps to the following XML:

Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]: 
Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]:   
f16cb93d-7bf0-4da7-a804-b9539d64576a
Sep 12 05:05:22 devstack nova-compute[3062]: 
Sep 12 05:05:22 devstack nova-compute[3062]: 
Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]:   
Sep 12 05:05:22 devstack nova-compute[3062]:   
7d5de2b0-cb66-4607-a5f5-60fd40db51c3
Sep 12 05:05:22 devstack nova-compute[3062]: 

In the failure case, the nova-compute logs include the following
interesting bits.  Note the additional '' lines in the XML.

Sep 12 04:48:43 devstack nova-compute[3062]: ERROR
nova.virt.libvirt.guest [None req-a7c5f15c-1e44-4cd1-bf57-45b819676b20
admin admin] Error defining a guest with XML: 

Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]: 
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]:   
08561cc0-5cf2-4eb7-a3f9-956f945e6c24
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]: 
Sep 12 04:48:43 devstack nova-compute[3062]: 
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]:   
007fac3d-8800-4f45-9531-e3bab5c86a1e
Sep 12 04:48:43 devstack nova-compute[3062]:   
Sep 12 04:48:43 devstack nova-compute[3062]: 

Sep 12 04:48:43 devstack nova-compute[3062]: : libvirtError: unsupported 
configuration: Found duplicate drive address for disk with target name 'sda' 
controller='0' bus='0' target='0' unit='0'
Sep 12 04:48:43 devstack nova-compute[3062]: ERROR nova.virt.libvirt.driver 
[None req-a7c5f15c-1e44-4cd1-bf57-45b819676b20 admin admin] [instance: 
cf4f2c6f-7391-4a49-8f40-5e5cda98f78b] Failed to start libvirt guest: 
libvirtError: unsupported configuration: Found duplicate drive address for disk 
with target name 'sda' controller='0' bus='0' target='0' unit='0'

Here is the libvirtd log in the failure case:

2018-09-12 04:48:43.312+: 16889: error :
virDomainDefCheckDuplicateDriveAddresses:5747 : unsupported
configuration: Found duplicate drive address for disk with target name
'sda' controller='0' bus='0' target='0' unit='0'

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

** Description changed:

  I'm using devstack stable/rocky on ubuntu 16.04.
  
- When running the command "nova boot --flavor m1.small --nic net-
- name=public --block-device
+ When running this command
+ 
+ nova boot --flavor m1.small --nic net-name=public --block-device
  
source=image,id=24e8e922-2687-48b5-a895-3134a650e00f,dest=volume,size=2,bootindex=0,shutdown=remove,bus=scsi
  --block-device
  source=blank,dest=volume,size=2,bootindex=1,shutdown=remove,bus=scsi
- --poll twovol" the instance fails to boot with the error "libvirtError:
- unsupported configuration: Found duplicate drive address for disk with
- target name 'sda' controller='0' bus='0' target='0' unit='0'"
+ --poll twovol
+ 
+ the instance fails to boot with the error:
+ 
+ libvirtError: unsupported configuration: Found duplicate drive address
+ for disk with target name 'sda' controller='0' bus='0' target='0'
+ unit='0'
  
  
  For some background information, this works:
  
  nova boot --flavor m1.small --nic net-name=public --block-device
  

[Yahoo-eng-team] [Bug 1790195] [NEW] performance problems starting up nova process due to regex code

2018-08-31 Thread Chris Friesen
Public bug reported:

We noticed that nova process startup seems to take a long time.  It
looks like one major culprit is the regex code at
https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py

Sean K Mooney highlighted one possible culprit:

 i dont really like this 
https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py#L128-L142
 def _get_all_chars():
 for i in range(0x):
 yield six.unichr(i)
 so that is got to loop 65535 times
 *going too
 and we call the function 17 times
 so that 1.1 million callse to re.escape every time we load that 
module

** Affects: nova
 Importance: Undecided
 Assignee: sean mooney (sean-k-mooney)
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1790195

Title:
  performance problems starting up nova process due to regex code

Status in OpenStack Compute (nova):
  New

Bug description:
  We noticed that nova process startup seems to take a long time.  It
  looks like one major culprit is the regex code at
  
https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py

  Sean K Mooney highlighted one possible culprit:

   i dont really like this 
https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py#L128-L142
   def _get_all_chars():
   for i in range(0x):
   yield six.unichr(i)
   so that is got to loop 65535 times
   *going too
   and we call the function 17 times
   so that 1.1 million callse to re.escape every time we load 
that module

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1790195/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1785270] [NEW] allow confirmation of resize/migration for migrations in "confirming" status

2018-08-03 Thread Chris Friesen
Public bug reported:

Confirmation of a resize is an RPC operation.  If a compute node fails
after a migration has been put into the "confirming" status there is no
way to confirm it again, causing the state of the instance to get
"stuck".

In the case of confirm_resize(), I don't see any problem with allowing
us to retry by sending another confirm_resize message. On the target
compute node the actual confirmation is synchronized by instance.uuid,
so there should be no races, and it already handles the "migration is
already confirmed" case.


The proposed code change would look something like this:

 @check_instance_state(vm_state=[vm_states.RESIZED])
 def confirm_resize(self, context, instance, migration=None):
 """Confirms a migration/resize and deletes the 'old' instance."""
 elevated = context.elevated()
 # NOTE(melwitt): We're not checking quota here because there isn't a
 # change in resource usage when confirming a resize. Resource
 # consumption for resizes are written to the database by compute, so
 # a confirm resize is just a clean up of the migration objects and a
 # state change in compute.
 if migration is None:
-migration = objects.Migration.get_by_instance_and_status(
-elevated, instance.uuid, 'finished')
+# Look for migrations in confirming state as well as finished to
+# handle cases where the confirm did not complete (eg. because
+# the compute node went away during the confirm).
+for status in ('finished', 'confirming'):
+try:
+migration = objects.Migration.get_by_instance_and_status(
+elevated, instance.uuid, status)
+break
+except exception.MigrationNotFoundByStatus:
+pass
+
+if migration is None:
+raise exception.MigrationNotFoundByStatus(
+instance_id=instance.uuid, status='finished|confirming')

** Affects: nova
 Importance: Low
 Status: Triaged


** Tags: resize

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1785270

Title:
  allow confirmation of resize/migration for migrations in "confirming"
  status

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Confirmation of a resize is an RPC operation.  If a compute node fails
  after a migration has been put into the "confirming" status there is
  no way to confirm it again, causing the state of the instance to get
  "stuck".

  In the case of confirm_resize(), I don't see any problem with allowing
  us to retry by sending another confirm_resize message. On the target
  compute node the actual confirmation is synchronized by instance.uuid,
  so there should be no races, and it already handles the "migration is
  already confirmed" case.

  
  The proposed code change would look something like this:

   @check_instance_state(vm_state=[vm_states.RESIZED])
   def confirm_resize(self, context, instance, migration=None):
   """Confirms a migration/resize and deletes the 'old' instance."""
   elevated = context.elevated()
   # NOTE(melwitt): We're not checking quota here because there isn't a
   # change in resource usage when confirming a resize. Resource
   # consumption for resizes are written to the database by compute, so
   # a confirm resize is just a clean up of the migration objects and a
   # state change in compute.
   if migration is None:
  -migration = objects.Migration.get_by_instance_and_status(
  -elevated, instance.uuid, 'finished')
  +# Look for migrations in confirming state as well as finished to
  +# handle cases where the confirm did not complete (eg. because
  +# the compute node went away during the confirm).
  +for status in ('finished', 'confirming'):
  +try:
  +migration = objects.Migration.get_by_instance_and_status(
  +elevated, instance.uuid, status)
  +break
  +except exception.MigrationNotFoundByStatus:
  +pass
  +
  +if migration is None:
  +raise exception.MigrationNotFoundByStatus(
  +instance_id=instance.uuid, status='finished|confirming')

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1785270/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1785123] [NEW] UEFI NVRAM lost on cold migration or resize

2018-08-02 Thread Chris Friesen
Public bug reported:

If you boot a virtual instance with UEFI, the UEFI NVRAM is lost on a
cold migration.

The default storage for the virtual UEFI NVRAM is in
/var/lib/libvirt/qemu/nvram/, and the file is not being copied over on
cold migration.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1785123

Title:
  UEFI NVRAM lost on cold migration or resize

Status in OpenStack Compute (nova):
  New

Bug description:
  If you boot a virtual instance with UEFI, the UEFI NVRAM is lost on a
  cold migration.

  The default storage for the virtual UEFI NVRAM is in
  /var/lib/libvirt/qemu/nvram/, and the file is not being copied over on
  cold migration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1785123/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1785086] [NEW] docs for RPC is out of date

2018-08-02 Thread Chris Friesen
Public bug reported:

The information in doc/source/reference/rpc.rst is stale and should
probably be updated or removed so that it doesn't confuse people.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: docs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1785086

Title:
  docs for RPC is out of date

Status in OpenStack Compute (nova):
  New

Bug description:
  The information in doc/source/reference/rpc.rst is stale and should
  probably be updated or removed so that it doesn't confuse people.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1785086/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1781643] [NEW] With remote storage, swap disk size changed after resize-revert

2018-07-13 Thread Chris Friesen
Public bug reported:


There seems to be an issue (discovered in Pike) where ceph-backed swap does not 
return to the original size if a resize operation is reverted.

Steps to reproduce:
1) Configure compute nodes to use remote ceph-backed storage for instances.
2) Launch a vm with with ephemeral and swap disk.  (The swap disk will be 
RBD-backed.)
3) Resize vm to a new flavor with larger swap disk size.  The swap disk will be 
resized to the larger size.
4) Resize-revert to original flavor.
5) Check actual disk sizes from within the VM and from ceph directly.

Expected behaviour:
VM swap disk size should be reverted back to original size.

Actual behaviour:
VM swap disk remains at the larger size.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: ceph compute resize

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1781643

Title:
  With remote storage, swap disk size changed after resize-revert

Status in OpenStack Compute (nova):
  New

Bug description:
  
  There seems to be an issue (discovered in Pike) where ceph-backed swap does 
not return to the original size if a resize operation is reverted.

  Steps to reproduce:
  1) Configure compute nodes to use remote ceph-backed storage for instances.
  2) Launch a vm with with ephemeral and swap disk.  (The swap disk will be 
RBD-backed.)
  3) Resize vm to a new flavor with larger swap disk size.  The swap disk will 
be resized to the larger size.
  4) Resize-revert to original flavor.
  5) Check actual disk sizes from within the VM and from ceph directly.

  Expected behaviour:
  VM swap disk size should be reverted back to original size.

  Actual behaviour:
  VM swap disk remains at the larger size.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1781643/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1538565] Re: Guest CPU does not support 1Gb hugepages with explicit models

2018-05-28 Thread Chris Friesen
The code at https://review.openstack.org/#/c/534384/ has been merged,
and should allow the operator to explicitly add the pdpe1gb flag.

Marking as fixed.

** Changed in: nova
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1538565

Title:
  Guest CPU does not support 1Gb hugepages with explicit models

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  The CPU flag pdpe1gb indicates that the CPU model supports 1 GB
  hugepages - without it, the Linux operating system refuses to allocate
  1 GB huge pages (and other things might go wrong if it did).

  Not all Intel CPU models support 1 GB huge pages, so the qemu options
  -cpu Haswell and -cpu Broadwell give you a vCPU that does not have the
  pdpe1gb flag. This is the correct thing to do, since the VM might be
  running on a Haswell that does not have 1GB huge pages.

  Problem is that Nova flavor extra specs with the libvirt driver for
  qemu/kvm only allow to define the CPU model, either an explicit model
  or "host". The host option means that all CPU flags in the host CPU
  are passed to the vCPU. However, the host option complicates VM
  migration since the CPU would change after migration.

  In conclusion, there is no good way to specify a CPU model that would
  imply the pdpe1gb flag.

  Huge pages are used eg with dpdk. They improve the performance of the
  VM mainly by reducing tlb size.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1538565/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1764556] Re: "nova list" fails with exception.ServiceNotFound if service is deleted and has no UUID

2018-04-18 Thread Chris Friesen
I think we could get into the bad state described in the bug if we do a
slightly different series of actions:

1) boot instance on Ocata
2) migrate instance 
3) delete compute node (thus deleting the service record) 
4) create compute node with same name 
5) migrate instance to newly-created compute node 
6) upgrade to Pike

This should result in the deleted service not having a UUID, which will
cause problems in Pike if we do a "nova list".


I suppose an argument could be made that this is an unlikely scenario, which is 
probably true. :)

** Changed in: nova
   Status: Fix Released => New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1764556

Title:
  "nova list" fails with exception.ServiceNotFound if service is deleted
  and has no UUID

Status in OpenStack Compute (nova):
  New

Bug description:
  We had a testcase where we booted an instance on Newton, migrated it
  off the compute node, deleted the compute node (and service), upgraded
  to Pike, created a new compute node with the same name, and migrated
  the instance back to the compute node.

  At this point the "nova list" command failed with
  exception.ServiceNotFound.

  It appears that since the Service has no UUID the _from_db_object()
  routine will try to add it, but the service.save() call fails because
  the service in question has been deleted.

  I reproduced the issue with stable/pike devstack.  I booted an
  instance, then created a fake entry in the "services" table without a
  UUID so the table looked like this:

  mysql>  select * from services;
  
+-+-+-++--++---+--+--+-+-+-+-+-+--+
  | created_at  | updated_at  | deleted_at  | id | host 
| binary | topic | report_count | disabled | deleted | 
disabled_reason | last_seen_up| forced_down | version | uuid
 |
  
+-+-+-++--++---+--+--+-+-+-+-+-+--+
  | 2018-02-20 16:10:07 | 2018-04-16 22:10:46 | NULL|  1 | 
devstack | nova-conductor | conductor |   477364 |0 |   0 | 
NULL| 2018-04-16 22:10:46 |   0 |  22 | 
c041d7cf-5047-4014-b50c-3ba6b5d95097 |
  | 2018-02-20 16:10:10 | 2018-04-16 22:10:54 | NULL|  2 | 
devstack | nova-compute   | compute   |   477149 |0 |   0 | 
NULL| 2018-04-16 22:10:54 |   0 |  22 | 
d0cfb63c-8b59-4b65-bb7e-6b89acd3fe35 |
  | 2018-02-20 16:10:10 | 2018-04-16 20:29:33 | 2018-04-16 20:30:33 |  3 | 
devstack | nova-compute   | compute   |   476432 |0 |   3 | 
NULL| 2018-04-16 20:30:33 |   0 |  22 | NULL
 |
  
+-+-+-++--++---+--+--+-+-+-+-+-+--+


  At this point, running "nova show " worked fine, but running
  "nova list" failed:

  stack@devstack:~/devstack$ nova list
  ERROR (ClientException): Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
   (HTTP 500) (Request-ID: 
req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6)

  
  The nova-api log looked like this:

  Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG nova.compute.api 
[None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Listing 1000 
instances in cell 09eb515f-9906-40bf-9be6-63b5e6ee279a(cell1) {{(pid=4261) 
_get_instances_by_filters_all_cells /opt/stack/nova/nova/compute/api.py:2559}}
  Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG 
oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo 
demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" acquired by 
"nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s 
{{(pid=4261) inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270}}
  Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG 
oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo 
demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" released by 
"nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s 
{{(pid=4261) inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282}}
  Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG 

[Yahoo-eng-team] [Bug 1764556] [NEW] "nova list" fails with exception.ServiceNotFound if service is deleted and has no UUID

2018-04-16 Thread Chris Friesen
Public bug reported:

We had a testcase where we booted an instance on Newton, migrated it off
the compute node, deleted the compute node (and service), upgraded to
Pike, created a new compute node with the same name, and migrated the
instance back to the compute node.

At this point the "nova list" command failed with
exception.ServiceNotFound.

It appears that since the Service has no UUID the _from_db_object()
routine will try to add it, but the service.save() call fails because
the service in question has been deleted.

I reproduced the issue with stable/pike devstack.  I booted an instance,
then created a fake entry in the "services" table without a UUID so the
table looked like this:

mysql>  select * from services;
+-+-+-++--++---+--+--+-+-+-+-+-+--+
| created_at  | updated_at  | deleted_at  | id | host   
  | binary | topic | report_count | disabled | deleted | 
disabled_reason | last_seen_up| forced_down | version | uuid
 |
+-+-+-++--++---+--+--+-+-+-+-+-+--+
| 2018-02-20 16:10:07 | 2018-04-16 22:10:46 | NULL|  1 | 
devstack | nova-conductor | conductor |   477364 |0 |   0 | 
NULL| 2018-04-16 22:10:46 |   0 |  22 | 
c041d7cf-5047-4014-b50c-3ba6b5d95097 |
| 2018-02-20 16:10:10 | 2018-04-16 22:10:54 | NULL|  2 | 
devstack | nova-compute   | compute   |   477149 |0 |   0 | 
NULL| 2018-04-16 22:10:54 |   0 |  22 | 
d0cfb63c-8b59-4b65-bb7e-6b89acd3fe35 |
| 2018-02-20 16:10:10 | 2018-04-16 20:29:33 | 2018-04-16 20:30:33 |  3 | 
devstack | nova-compute   | compute   |   476432 |0 |   3 | 
NULL| 2018-04-16 20:30:33 |   0 |  22 | NULL
 |
+-+-+-++--++---+--+--+-+-+-+-+-+--+


At this point, running "nova show " worked fine, but running "nova
list" failed:

stack@devstack:~/devstack$ nova list
ERROR (ClientException): Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
 (HTTP 500) (Request-ID: 
req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6)


The nova-api log looked like this:

Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG nova.compute.api 
[None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] Listing 1000 
instances in cell 09eb515f-9906-40bf-9be6-63b5e6ee279a(cell1) {{(pid=4261) 
_get_instances_by_filters_all_cells /opt/stack/nova/nova/compute/api.py:2559}}
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG 
oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo 
demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" acquired by 
"nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s 
{{(pid=4261) inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270}}
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG 
oslo_concurrency.lockutils [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo 
demo] Lock "09eb515f-9906-40bf-9be6-63b5e6ee279a" released by 
"nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s 
{{(pid=4261) inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282}}
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: DEBUG 
nova.objects.service [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 demo demo] 
Generated UUID 4368a7ff-f589-4197-b0b9-d2afdb71ca33 for service 3 {{(pid=4261) 
_from_db_object /opt/stack/nova/nova/objects/service.py:245}}
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR 
nova.api.openstack.extensions [None req-b7e1b5f9-e7b4-4ccf-ba28-e8b3e1acd2f6 
demo demo] Unexpected exception in API method: ServiceNotFound: Service 3 could 
not be found.
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR 
nova.api.openstack.extensions Traceback (most recent call last):
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR 
nova.api.openstack.extensions   File 
"/opt/stack/nova/nova/api/openstack/extensions.py", line 336, in wrapped
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR 
nova.api.openstack.extensions return f(*args, **kwargs)
Apr 16 22:11:00 devstack devstack@n-api.service[4258]: ERROR 
nova.api.openstack.extensions   File 

[Yahoo-eng-team] [Bug 1763766] [NEW] nova needs to disallow topology changes on image rebuild

2018-04-13 Thread Chris Friesen
Public bug reported:

When doing a rebuild the assumption throughout the code is that we are
not changing the resources consumed by the guest (that is what a resize
is for).  The complication here is that there are a number of image
properties which might affect the instance resource consumption (in
conjunction with a suitable flavor):

hw_numa_nodes=X
hw_numa_cpus.X=Y
hw_numa_mem.X=Y
hw_mem_page_size=X
hw_cpu_thread_policy=X
hw_cpu_policy=X

Due to the assumptions made in the rest of the code, we need to add a
check to ensure that on a rebuild the above image properties do not
differ between the old and new images.


While they might look suspicious, I think that the following image properties 
*should* be allowed to differ, since they only affect the topology seen by the 
guest:

hw_cpu_threads
hw_cpu_cores
hw_cpu_sockets
hw_cpu_max_threads
hw_cpu_max_cores
hw_cpu_max_sockets
hw_cpu_realtime_mask

** Affects: nova
 Importance: Medium
 Status: Triaged

** Affects: nova/ocata
 Importance: Medium
 Status: Confirmed

** Affects: nova/pike
 Importance: Medium
 Status: Confirmed

** Affects: nova/queens
 Importance: Medium
 Status: Confirmed


** Tags: compute rebuild

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1763766

Title:
  nova needs to disallow topology changes on image rebuild

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) ocata series:
  Confirmed
Status in OpenStack Compute (nova) pike series:
  Confirmed
Status in OpenStack Compute (nova) queens series:
  Confirmed

Bug description:
  When doing a rebuild the assumption throughout the code is that we are
  not changing the resources consumed by the guest (that is what a
  resize is for).  The complication here is that there are a number of
  image properties which might affect the instance resource consumption
  (in conjunction with a suitable flavor):

  hw_numa_nodes=X
  hw_numa_cpus.X=Y
  hw_numa_mem.X=Y
  hw_mem_page_size=X
  hw_cpu_thread_policy=X
  hw_cpu_policy=X

  Due to the assumptions made in the rest of the code, we need to add a
  check to ensure that on a rebuild the above image properties do not
  differ between the old and new images.

  
  While they might look suspicious, I think that the following image properties 
*should* be allowed to differ, since they only affect the topology seen by the 
guest:

  hw_cpu_threads
  hw_cpu_cores
  hw_cpu_sockets
  hw_cpu_max_threads
  hw_cpu_max_cores
  hw_cpu_max_sockets
  hw_cpu_realtime_mask

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1763766/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1552777] Re: resizing from flavor with swap to one without swap puts instance into Error status

2018-03-21 Thread Chris Friesen
** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1552777

Title:
  resizing from flavor with swap to one without swap puts instance into
  Error status

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  In a single-node devstack (current trunk, nova commit 6e1051b7), if
  you boot an instance with a flavor that has nonzero swap and then
  resize to a flavor with zero swap it causes an exception.  It seems
  that we somehow neglect to remove the swap file from the instance.

   2016-03-03 10:02:41.415 ERROR nova.virt.libvirt.guest 
[req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] Error launching a 
defined domain with XML: 
instance-0001
54711b56-fa72-4eac-a5d3-aa29ed128098

  http://openstack.org/xmlns/libvirt/nova/1.0;>

asdf
2016-03-03 16:02:39

  512
  1
  0
  0
  1


  admin
  alt_demo


  

524288
524288
1

  1024


  
OpenStack Foundation
OpenStack Nova
13.0.0
03000200-0400-0500-0006-000700080009
54711b56-fa72-4eac-a5d3-aa29ed128098
Virtual Machine
  


  hvm
  
/opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/kernel
  
/opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/ramdisk
  root=/dev/vda console=tty0 console=ttyS0
  
  


  
  


  


  
  
  

destroy
restart
destroy

  /usr/bin/kvm-spice
  




  
  




  
  





  
  

  
  
  

  
  





  
  


  
  

  
  


  
  
  
  

  
  


  
  


  

  

  2016-03-03 10:02:41.417 ERROR nova.compute.manager 
[req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] Setting instance vm_state to ERROR
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] Traceback (most recent call last):
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/compute/manager.py", line 3999, in finish_resize
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] disk_info, image_meta)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/compute/manager.py", line 3964, in _finish_resize
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] old_instance_type)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in 
__exit__
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] six.reraise(self.type_, self.value, 
self.tb)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/compute/manager.py", line 3959, in _finish_resize
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] block_device_info, power_on)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 7202, in finish_migration
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] vifs_already_plugged=True)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 4862, in 
_create_domain_and_network
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] xml, pause=pause, power_on=power_on)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 4793, in _create_domain
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] guest.launch(pause=pause)
  2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   

[Yahoo-eng-team] [Bug 1756179] [NEW] deleting a nova-compute service leaves orphaned records in placement and host mapping

2018-03-15 Thread Chris Friesen
Public bug reported:

Currently when deleting a nova-compute service via the API, we will
delete the service and compute_node records in the DB, but the placement
resource provider and host mapping records will be orphaned.

The orphaned resource provider records have been found to cause
scheduler failures if you re-create the compute node with the same name
(but a different UUID).  It has been theorized that the stale host
mapping records could end up pointing at the wrong cell.

In discussions on IRC (http://eavesdrop.openstack.org/irclogs
/%23openstack-nova/%23openstack-
nova.2018-03-15.log.html#t2018-03-15T19:30:13) it was proposed that we
should

1. delete the RP in placement
2. delete the host mapping
3. delete the service/node

Optionally we could delete the compute node prior to deleting the
service to make it explicit and because the ordering is slightly more
logical, but this is not a requirement since it will be done implicitly
as part of deleting the service.

** Affects: nova
 Importance: Medium
 Status: Triaged

** Affects: nova/pike
 Importance: Undecided
 Status: New

** Affects: nova/queens
 Importance: Undecided
 Status: New


** Tags: api cells placement

** Summary changed:

- deleting a nova-compute service leaves orphaned records in placement
+ deleting a nova-compute service leaves orphaned records in placement and host 
mapping

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1756179

Title:
  deleting a nova-compute service leaves orphaned records in placement
  and host mapping

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) pike series:
  New
Status in OpenStack Compute (nova) queens series:
  New

Bug description:
  Currently when deleting a nova-compute service via the API, we will
  delete the service and compute_node records in the DB, but the
  placement resource provider and host mapping records will be orphaned.

  The orphaned resource provider records have been found to cause
  scheduler failures if you re-create the compute node with the same
  name (but a different UUID).  It has been theorized that the stale
  host mapping records could end up pointing at the wrong cell.

  In discussions on IRC (http://eavesdrop.openstack.org/irclogs
  /%23openstack-nova/%23openstack-
  nova.2018-03-15.log.html#t2018-03-15T19:30:13) it was proposed that we
  should

  1. delete the RP in placement
  2. delete the host mapping
  3. delete the service/node

  Optionally we could delete the compute node prior to deleting the
  service to make it explicit and because the ordering is slightly more
  logical, but this is not a requirement since it will be done
  implicitly as part of deleting the service.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1756179/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1755981] [NEW] powering off and on an instance can result in instance boot failure due to serial port handling race

2018-03-14 Thread Chris Friesen
Public bug reported:

The following is specific to the libvirt driver.

When we call power_off() it calls _destroy(), which in turn calls
self._get_serial_ports_from_guest() and loops over all the serial ports
calling serial_console.release_port() on each.  This removes the host
TCP port from ALLOCATED_PORTS (which is the set of allocated ports on
the host).

Then when we call power_on(), it again calls _destroy(), which again
calls self._get_serial_ports_from_guest().  This will return the same
set of ports that it did before.  This is a problem, because those ports
could have been allocated to another instance in the meantime!

So in the case where one or more of those ports had been allocated to
another instance, we call serial_console.release_port() on them, and
remove them from ALLOCATED_PORTS.

Then as part of power_on() we will create new XML with new serial ports,
which could select the ports that we just removed from ALLOCATED_PORTS
(which are actually in use by another instance).  When qemu tries to
bind to this port it will fail, causing the instance to error out and
stay in the SHUTOFF state.

One possible solution would be to call guest.detach_device() on the
"serial" and "console" devices from the guest in the power_off()
routine.  That way when we call _destroy() in the power_on() routine
there wouldn't be any devices returned by
_get_serial_ports_from_guest().  This is a bit messy though, so if
anyone has any better ideas I'd like to hear about it.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1755981

Title:
  powering off and on an instance can result in instance boot failure
  due to serial port handling race

Status in OpenStack Compute (nova):
  New

Bug description:
  The following is specific to the libvirt driver.

  When we call power_off() it calls _destroy(), which in turn calls
  self._get_serial_ports_from_guest() and loops over all the serial
  ports calling serial_console.release_port() on each.  This removes the
  host TCP port from ALLOCATED_PORTS (which is the set of allocated
  ports on the host).

  Then when we call power_on(), it again calls _destroy(), which again
  calls self._get_serial_ports_from_guest().  This will return the same
  set of ports that it did before.  This is a problem, because those
  ports could have been allocated to another instance in the meantime!

  So in the case where one or more of those ports had been allocated to
  another instance, we call serial_console.release_port() on them, and
  remove them from ALLOCATED_PORTS.

  Then as part of power_on() we will create new XML with new serial
  ports, which could select the ports that we just removed from
  ALLOCATED_PORTS (which are actually in use by another instance).  When
  qemu tries to bind to this port it will fail, causing the instance to
  error out and stay in the SHUTOFF state.

  One possible solution would be to call guest.detach_device() on the
  "serial" and "console" devices from the guest in the power_off()
  routine.  That way when we call _destroy() in the power_on() routine
  there wouldn't be any devices returned by
  _get_serial_ports_from_guest().  This is a bit messy though, so if
  anyone has any better ideas I'd like to hear about it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1755981/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1754782] [NEW] we skip critical scheduler filters when forcing the host on instance boot

2018-03-09 Thread Chris Friesen
Public bug reported:

When booting an instance it's possible to force it to be placed on a
specific host using the  "--availability-zone nova:host" syntax.

If you do this, the code at 
https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L581
 will return early rather than call self.filter_handler.get_filtered_objects()

Based on discussions at the PTG with Dan Smith, the simplest solution
would be to create a flag similar to RUN_ON_REBUILD which would be
applied to the various scheduler filters in a manner analogous to how
rebuild is handled now.

Presumably we'd want to call something like this during the instance
boot code to ensure we hit the existing "if not check_type" at L581:

request_spec.scheduler_hints['_nova_check_type'] = ['build']


Then in the various critical filers (NUMATopologyFilter for example, and 
PciPassthroughFilter, and maybe some others like ComputeFilter) we could define 
something like "RUN_ON_BUILD = True" to ensure that they run even when forcing 
a host.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1754782

Title:
  we skip critical scheduler filters when forcing the host on instance
  boot

Status in OpenStack Compute (nova):
  New

Bug description:
  When booting an instance it's possible to force it to be placed on a
  specific host using the  "--availability-zone nova:host" syntax.

  If you do this, the code at 
  
https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L581
 will return early rather than call self.filter_handler.get_filtered_objects()

  Based on discussions at the PTG with Dan Smith, the simplest solution
  would be to create a flag similar to RUN_ON_REBUILD which would be
  applied to the various scheduler filters in a manner analogous to how
  rebuild is handled now.

  Presumably we'd want to call something like this during the instance
  boot code to ensure we hit the existing "if not check_type" at L581:

  request_spec.scheduler_hints['_nova_check_type'] = ['build']

  
  Then in the various critical filers (NUMATopologyFilter for example, and 
PciPassthroughFilter, and maybe some others like ComputeFilter) we could define 
something like "RUN_ON_BUILD = True" to ensure that they run even when forcing 
a host.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1754782/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1538565] Re: Guest CPU does not support 1Gb hugepages with explicit models

2018-03-01 Thread Chris Friesen
In recent versions of qemu the "Skylake-Server" cpu model has the flag,
but any earlier Intel processor models do not.

** Changed in: nova
   Status: Expired => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1538565

Title:
  Guest CPU does not support 1Gb hugepages with explicit models

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  The CPU flag pdpe1gb indicates that the CPU model supports 1 GB
  hugepages - without it, the Linux operating system refuses to allocate
  1 GB huge pages (and other things might go wrong if it did).

  Not all Intel CPU models support 1 GB huge pages, so the qemu options
  -cpu Haswell and -cpu Broadwell give you a vCPU that does not have the
  pdpe1gb flag. This is the correct thing to do, since the VM might be
  running on a Haswell that does not have 1GB huge pages.

  Problem is that Nova flavor extra specs with the libvirt driver for
  qemu/kvm only allow to define the CPU model, either an explicit model
  or "host". The host option means that all CPU flags in the host CPU
  are passed to the vCPU. However, the host option complicates VM
  migration since the CPU would change after migration.

  In conclusion, there is no good way to specify a CPU model that would
  imply the pdpe1gb flag.

  Huge pages are used eg with dpdk. They improve the performance of the
  VM mainly by reducing tlb size.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1538565/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1750623] [NEW] rebuild to same host with different image shouldn't check with placement

2018-02-20 Thread Chris Friesen
Public bug reported:

When doing a rebuild-to-same-host but with a different image, all we
really want to do is ensure that the image properties for the new image
are still valid for the current host.  Accordingly we need to go through
the scheduler (to run the image-related filters) but we don't want to do
anything related to resource consumption.

Currently the scheduler will contact placement to get a pre-filtered
list of compute nodes with sufficient free resources for the instance in
question.  If the instance is on a compute node that is close to full,
this may result in the current compute node being filtered out of the
list, which will result in a noValidHost exception.

Ideally, in the case where we are doing a rebuild-to-same-host we would
simply retrieve the information for the current compute node from the DB
instead of from placement, and then run the image-related scheduler
filters.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: rebuild scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1750623

Title:
  rebuild to same host with different image shouldn't check with
  placement

Status in OpenStack Compute (nova):
  New

Bug description:
  When doing a rebuild-to-same-host but with a different image, all we
  really want to do is ensure that the image properties for the new
  image are still valid for the current host.  Accordingly we need to go
  through the scheduler (to run the image-related filters) but we don't
  want to do anything related to resource consumption.

  Currently the scheduler will contact placement to get a pre-filtered
  list of compute nodes with sufficient free resources for the instance
  in question.  If the instance is on a compute node that is close to
  full, this may result in the current compute node being filtered out
  of the list, which will result in a noValidHost exception.

  Ideally, in the case where we are doing a rebuild-to-same-host we
  would simply retrieve the information for the current compute node
  from the DB instead of from placement, and then run the image-related
  scheduler filters.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1750623/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1750618] [NEW] rebuild to same host with a different image results in erroneously doing a Claim

2018-02-20 Thread Chris Friesen
Public bug reported:

As of stable/pike if we do a rebuild-to-same-node with a new image, it
results in ComputeManager.rebuild_instance() being called with
"scheduled_node=" and "recreate=False".  This results in a new
Claim, which seems wrong since we're not changing the flavor and that
claim could fail if the compute node is already full.

The comments in ComputeManager.rebuild_instance() make it appear that it
expects both "recreate" and "scheduled_node" to be None for the rebuild-
to-same-host case otherwise it will do a Claim.  However, if we rebuild
to a different image it ends up going through the scheduler which means
that "scheduled_node" is not None.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute rebuild

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1750618

Title:
  rebuild to same host with a different image results in erroneously
  doing a Claim

Status in OpenStack Compute (nova):
  New

Bug description:
  As of stable/pike if we do a rebuild-to-same-node with a new image, it
  results in ComputeManager.rebuild_instance() being called with
  "scheduled_node=" and "recreate=False".  This results in a
  new Claim, which seems wrong since we're not changing the flavor and
  that claim could fail if the compute node is already full.

  The comments in ComputeManager.rebuild_instance() make it appear that
  it expects both "recreate" and "scheduled_node" to be None for the
  rebuild-to-same-host case otherwise it will do a Claim.  However, if
  we rebuild to a different image it ends up going through the scheduler
  which means that "scheduled_node" is not None.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1750618/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1605098] Re: Nova usage not showing server real uptime

2018-01-10 Thread Chris Friesen
Nova reserves resources for the instance even if it's not running, so
the reported uptime probably shouldn't be used for billing.

Also, the uptime gets reset on a resize/revert-resize/rescue, further
making it tricky to use for billing.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1605098

Title:
  Nova usage not showing server real uptime

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Hi All,

  I am trying to calculate openstack server "uptime" where nova os usage
  is giving server creation time, which cant take forward for billing,
  Is there any way to do ?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1605098/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1734394] [NEW] nova microversion 2.36 accidentally removed support for "force" when setting quotas

2017-11-24 Thread Chris Friesen
Public bug reported:

It is supposed to be possible to specify the "force" option when
updating a quota-set.  Up to microversion 2.35 this works as expected.

However, in 2.36 it no longer works, and nova-api sends back:

RESP BODY: {"badRequest": {"message": "Invalid input for field/attribute
quota_set. Value: {u'cores': 95, u'force': True}. Additional properties
are not allowed (u'force' was unexpected)", "code": 400}}


The problem seems to be that in schemas/quota_sets.py the "force" parameter is 
not in quota_resources, but rather is added to "update_quota_set".  When 
creating update_quota_set_v236 they copied quota_resources instead of copying 
update_quota_set, and this meant that they lost the support for the "force" 
parameter.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: api ocata-backport-potential pike-backport-potential quotas

** Tags added: ocata-backport-potential pike-backport-potential quotas

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1734394

Title:
  nova microversion 2.36 accidentally removed support for "force" when
  setting quotas

Status in OpenStack Compute (nova):
  New

Bug description:
  It is supposed to be possible to specify the "force" option when
  updating a quota-set.  Up to microversion 2.35 this works as expected.

  However, in 2.36 it no longer works, and nova-api sends back:

  RESP BODY: {"badRequest": {"message": "Invalid input for
  field/attribute quota_set. Value: {u'cores': 95, u'force': True}.
  Additional properties are not allowed (u'force' was unexpected)",
  "code": 400}}

  
  The problem seems to be that in schemas/quota_sets.py the "force" parameter 
is not in quota_resources, but rather is added to "update_quota_set".  When 
creating update_quota_set_v236 they copied quota_resources instead of copying 
update_quota_set, and this meant that they lost the support for the "force" 
parameter.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1734394/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1724686] [NEW] authentication code hangs when there are three or more admin keystone endpoints

2017-10-18 Thread Chris Friesen
Public bug reported:

I'm running stable/pike devstack, and I was playing around with what
happens when there are many endpoints in multiple regions, and I
stumbled over a scenario where the keystone authentication code hangs.

My original endpoint list looked like this:

ubuntu@devstack:/opt/stack/devstack$ openstack endpoint list
+--+---+--+-+-+---+--+
| ID   | Region| Service Name | Service Type
| Enabled | Interface | URL  |
+--+---+--+-+-+---+--+
| 0a9979ebfdbf48ce91ccf4e2dd952c1a | RegionOne | kingbird | synchronization 
| True| internal  | http://127.0.0.1:8118/v1.0   |
| 11d5507afe2a4eddb4f030695699114f | RegionOne | placement| placement   
| True| public| http://128.224.186.226/placement |
| 1e42cf139398405188755b7e00aecb4d | RegionOne | keystone | identity
| True| admin | http://128.224.186.226/identity  |
| 2daf99edecae4afba88bb58233595481 | RegionOne | glance   | image   
| True| public| http://128.224.186.226/image |
| 2ece52e8bbb34d47b9bd5611f5959385 | RegionOne | kingbird | synchronization 
| True| admin | http://127.0.0.1:8118/v1.0   |
| 4835a089666a4b03bd2f499457ade6c2 | RegionOne | kingbird | synchronization 
| True| public| http://127.0.0.1:8118/v1.0   |
| 78e9fbc0a47642268eda3e3576920f37 | RegionOne | nova | compute 
| True| public| http://128.224.186.226/compute/v2.1  |
| 96a1e503dc0e4520a190b01f6a0cf79c | RegionOne | keystone | identity
| True| public| http://128.224.186.226/identity  |
| a1887dbc8c5e4af5b4a6dc5ce224b8ff | RegionOne | cinderv2 | volumev2
| True| public| http://128.224.186.226/volume/v2/$(project_id)s  |
| b7d5938141694a4c87adaed5105ea3ab | RegionOne | cinder   | volume  
| True| public| http://128.224.186.226/volume/v1/$(project_id)s  |
| bb169382cbea4715964e4652acd48070 | RegionOne | nova_legacy  | compute_legacy  
| True| public| http://128.224.186.226/compute/v2/$(project_id)s |
| e01c8d8e08874d61b9411045a99d4860 | RegionOne | neutron  | network 
| True| public| http://128.224.186.226:9696/ |
| f94c96ed474249a29a6c0a1bb2b2e500 | RegionOne | cinderv3 | volumev3
| True| public| http://128.224.186.226/volume/v3/$(project_id)s  |
+--+---+--+-+-+---+--+

I was able to successfully run the following python code:

from keystoneauth1 import loading
from keystoneauth1 import loading
from keystoneauth1 import session
from keystoneclient.v3 import client
loader = loading.get_plugin_loader("password")
auth = 
loader.load_from_options(username='admin',password='secret',project_name='admin',auth_url='http://128.224.186.226/identity')
sess = session.Session(auth=auth)
keystone = client.Client(session=sess)
keystone.services.list()

I then duplicated all of the endpoints in a new region "region2", and
was able to run the python code.  When I duplicated all the endpoints
again in a new region "region3" (for a total of 39 endpoints) the python
code hung at the final line.

Removing all the "region3" endpoints allowed the python code to work
again.

During all of this the command "openstack endpoint list" worked fine.

Further testing seems to indicate that it is the third "admin" keystone
endpoint that is causing the problem.  I can add multiple "public"
keystone endpoints, but three or more "admin" keystone endpoints cause
the python code to hang.

** Affects: keystone
 Importance: Undecided
 Status: New

** Summary changed:

- authentication code hangs when there are many endpoints
+ authentication code hangs when there are three or more admin keystone 
endpoints

** Description changed:

  I'm running stable/pike devstack, and I was playing around with what
  happens when there are many endpoints in multiple regions, and I
  stumbled over a scenario where the keystone authentication code hangs.
  
  My original endpoint list looked like this:
  
  ubuntu@devstack:/opt/stack/devstack$ openstack endpoint list
  
+--+---+--+-+-+---+--+
  | ID   | Region| Service Name | Service Type  
  | Enabled | Interface | URL  |
  

[Yahoo-eng-team] [Bug 1284719] Re: buggy live migration rollback when using shared storage

2017-08-28 Thread Chris Friesen
** Changed in: nova
   Status: Expired => Incomplete

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1284719

Title:
  buggy live migration rollback when using shared storage

Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  I'm running the current Icehouse code in devstack.  I was looking at
  the code and noticed something suspicious.

  It looks like if we try to migrate a shared-storage instance and fail
  and end up rolling back we could end up with messed-up networking on
  the destination host.

  When setting up a live migration we unconditionally run
  ComputeManager.pre_live_migration() on the destination host to do
  various things including setting up networks on the host.

  If something goes wrong with the live migration in
  ComputeManager._rollback_live_migration() we will only call
  self.compute_rpcapi.rollback_live_migration_at_destination() if we're
  doing block migration or volume-backed migration that isn't shared
  storage.

  However, looking at
  ComputeManager.rollback_live_migration_at_destination(), I also see it
  cleaning up networking as well as block device.  If we never call that
  cleanup code, then the networking stuff that was done in
  pre_live_migration() won't get rolled back.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1284719/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1712684] [NEW] allocations not immediately removed when instance is deleted

2017-08-23 Thread Chris Friesen
Public bug reported:

Based on code inspection and a discussion with mriedem on IRC, it
appears that when deleting an instance in a pure-Pike cloud the
allocations are not removed until the update_available_resource()
periodic task calls ResourceTracker._update_usage_from_instances(),
which calls _remove_deleted_instances_allocations().

In a mixed Ocata/Pike cloud the allocation will be freed up immediately
when _update_usage_from_instance() calls
self.reportclient.update_instance_allocation().

In the ServerMovingTests functional test we bypass this by forcing the
periodic task to run before checking that the allocations have been
removed.

** Affects: nova
 Importance: High
 Status: Triaged


** Tags: compute pike-rc-potential placement scheduler

** Summary changed:

- allocations not immediately removed when instance deleted
+ allocations not immediately removed when instance is deleted

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1712684

Title:
  allocations not immediately removed when instance is deleted

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Based on code inspection and a discussion with mriedem on IRC, it
  appears that when deleting an instance in a pure-Pike cloud the
  allocations are not removed until the update_available_resource()
  periodic task calls ResourceTracker._update_usage_from_instances(),
  which calls _remove_deleted_instances_allocations().

  In a mixed Ocata/Pike cloud the allocation will be freed up
  immediately when _update_usage_from_instance() calls
  self.reportclient.update_instance_allocation().

  In the ServerMovingTests functional test we bypass this by forcing the
  periodic task to run before checking that the allocations have been
  removed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1712684/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1695991] [NEW] "nova-manage db online_data_migrations" doesn't report matched/migrated properly

2017-06-05 Thread Chris Friesen
Public bug reported:

When running "nova-manage db online_data_migrations", it will report how
many items matched the query and how many of the matching items were
migrated.

However, most of the migration routines are not properly reporting the
"total matched" count when "max_count" is specified.  This makes it
difficult to know whether you have to call it again or not when
specifying "--max-count" explicitly.

Take for example Flavor.migrate_flavors().  This limits the value of
main_db_ids to a max of "count":

main_db_ids = _get_main_db_flavor_ids(ctxt, count)
count_all = len(main_db_ids)

return count_all, count_hit

If someone sees that there were 50 items total and 50 items were
converted, they may think that all the work is done.  It would be better
to call _get_main_db_flavor_ids() with no limit to the number of
matches, and apply the limit to the number of conversions.

Alternately, we should document that if --max-count is used then "nova-
manage db online_data_migrations" should be called multiple times until
*no* matches are reported and we can basically ignore the number of
hits.  (Or until no hits are reported, which would more closely align
with the code in the case that max-count isn't specified explicitly.)

** Affects: nova
 Importance: Undecided
 Status: New

** Description changed:

  When running "nova-manage db online_data_migrations", it will report how
  many items matched the query and how many of the matching items were
  migrated.
  
  However, most of the migration routines are not properly reporting the
  "total matched" count when "max_count" is specified.  This makes it
  difficult to know whether you have to call it again or not when
  specifying "--max-count" explicitly.
  
  Take for example Flavor.migrate_flavors().  This limits the value of
  main_db_ids to a max of "count":
  
- main_db_ids = _get_main_db_flavor_ids(ctxt, count)
+ main_db_ids = _get_main_db_flavor_ids(ctxt, count)
  count_all = len(main_db_ids)
  
  return count_all, count_hit
  
- 
- If someone sees that there were 50 items total and 50 items were converted, 
they may think that all the work is done.  It would be better to call 
_get_main_db_flavor_ids() with no limit to the number of matches, and apply the 
limit to the number of conversions.
+ If someone sees that there were 50 items total and 50 items were
+ converted, they may think that all the work is done.  It would be better
+ to call _get_main_db_flavor_ids() with no limit to the number of
+ matches, and apply the limit to the number of conversions.
  
  Alternately, we should document that if --max-count is used then "nova-
  manage db online_data_migrations" should be called multiple times until
  *no* matches are reported and we can basically ignore the number of
- hits.
+ hits.  (Or until no hits are reported, which would more closely align
+ with the code in the case that max-count isn't specified explicitly.)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1695991

Title:
  "nova-manage db online_data_migrations" doesn't report
  matched/migrated properly

Status in OpenStack Compute (nova):
  New

Bug description:
  When running "nova-manage db online_data_migrations", it will report
  how many items matched the query and how many of the matching items
  were migrated.

  However, most of the migration routines are not properly reporting the
  "total matched" count when "max_count" is specified.  This makes it
  difficult to know whether you have to call it again or not when
  specifying "--max-count" explicitly.

  Take for example Flavor.migrate_flavors().  This limits the value of
  main_db_ids to a max of "count":

  main_db_ids = _get_main_db_flavor_ids(ctxt, count)
  count_all = len(main_db_ids)
  
  return count_all, count_hit

  If someone sees that there were 50 items total and 50 items were
  converted, they may think that all the work is done.  It would be
  better to call _get_main_db_flavor_ids() with no limit to the number
  of matches, and apply the limit to the number of conversions.

  Alternately, we should document that if --max-count is used then
  "nova-manage db online_data_migrations" should be called multiple
  times until *no* matches are reported and we can basically ignore the
  number of hits.  (Or until no hits are reported, which would more
  closely align with the code in the case that max-count isn't specified
  explicitly.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1695991/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1695965] [NEW] "nova-manage db online_data_migrations" exit code is strange

2017-06-05 Thread Chris Friesen
Public bug reported:

If I'm reading the code right, the exit value for "nova-manage db
online_data_migrations" will be 1 if we actually performed some
migrations and 0 if we performed no migrations, either because there
were no remaining migrations or because the migration code raised an
exception.

This seems less than useful for someone attempting to script repeated
calls to this with --max-count set.  The caller needs to parse the
output to determine whether or not it was successful.

I think it would make more sense to have the exit code as follows:

0 -- no errors and completed
1 -- one of the migrations raised an exception, needs manual action
3 -- no errors but not yet complete, need to call again

since it would allow for an automated retry based solely on the exit
code.

At the very least, the exit code should be nonzero for the case where
one of the migrations raised an exception, and 0 for the case where no
exception was raised.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1695965

Title:
  "nova-manage db online_data_migrations" exit code is strange

Status in OpenStack Compute (nova):
  New

Bug description:
  If I'm reading the code right, the exit value for "nova-manage db
  online_data_migrations" will be 1 if we actually performed some
  migrations and 0 if we performed no migrations, either because there
  were no remaining migrations or because the migration code raised an
  exception.

  This seems less than useful for someone attempting to script repeated
  calls to this with --max-count set.  The caller needs to parse the
  output to determine whether or not it was successful.

  I think it would make more sense to have the exit code as follows:

  0 -- no errors and completed
  1 -- one of the migrations raised an exception, needs manual action
  3 -- no errors but not yet complete, need to call again

  since it would allow for an automated retry based solely on the exit
  code.

  At the very least, the exit code should be nonzero for the case where
  one of the migrations raised an exception, and 0 for the case where no
  exception was raised.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1695965/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1691780] [NEW] port id is incorrectly logged in _update_port_binding_for_instance

2017-05-18 Thread Chris Friesen
Public bug reported:

At line 2484 of
https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py
the code is accessing  p[‘id’]  in the LOG.info block,  but that means
it logs the last entry that it iterated over in the previous loop over
the ports rather than the port_id being processed in the current loop.

We see this when we have multiple ports, it suggests it is updating the
same port over and over, when its actually working properly

2017-05-10 16:39:32.936 72563 INFO nova.network.neutronv2.api [req-
56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e
7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27
-a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b
with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'}

2017-05-10 16:39:33.905 72563 INFO nova.network.neutronv2.api [req-
56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e
7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27
-a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b
with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'}

2017-05-10 16:39:35.084 72563 INFO nova.network.neutronv2.api [req-
56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e
7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27
-a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b
with attributes {'binding:profile': {}, 'binding:host_id': 'compute-6'}

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: In Progress


** Tags: neutron

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1691780

Title:
  port id is incorrectly logged in _update_port_binding_for_instance

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  At line 2484 of
  https://github.com/openstack/nova/blob/master/nova/network/neutronv2/api.py
  the code is accessing  p[‘id’]  in the LOG.info block,  but that means
  it logs the last entry that it iterated over in the previous loop over
  the ports rather than the port_id being processed in the current loop.

  We see this when we have multiple ports, it suggests it is updating
  the same port over and over, when its actually working properly

  2017-05-10 16:39:32.936 72563 INFO nova.network.neutronv2.api [req-
  56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e
  7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27
  -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b
  with attributes {'binding:profile': {}, 'binding:host_id':
  'compute-6'}

  2017-05-10 16:39:33.905 72563 INFO nova.network.neutronv2.api [req-
  56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e
  7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27
  -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b
  with attributes {'binding:profile': {}, 'binding:host_id':
  'compute-6'}

  2017-05-10 16:39:35.084 72563 INFO nova.network.neutronv2.api [req-
  56a25602-5598-48c5-977f-7d76582c2832 a5f3dc1ec00e4ee4a9c7d53163d3508e
  7c2d9914234f4a0ab5e72d802e0f9782 - - -] [instance: d36a69e3-77ba-4f27
  -a15d-be24eee0ae81] Updating port ffe01a49-a569-435d-96f3-18da8a6ca27b
  with attributes {'binding:profile': {}, 'binding:host_id':
  'compute-6'}

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1691780/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1690890] [NEW] error message not clear for shared live migration with block storage

2017-05-15 Thread Chris Friesen
Public bug reported:

When using an older microversion (2.25 or earlier) with boot-from-image,
and the user forgets to specify block-migration, the error message
returned is this:

"Live migration can not be used without shared storage except a booted
from volume VM which does not have a local disk."

This has a couple things wrong with it. First, the triple-negative is a
bit confusing, especially for non-native-english speakers.  Second, it
implies that you cannot do a block migration, which is obviously false.

I think a more clear error message would be something like:

"Shared storage migration requires either shared storage or boot-from-
volume with no local disks."

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: In Progress


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1690890

Title:
  error message not clear for shared live migration with block storage

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  When using an older microversion (2.25 or earlier) with boot-from-
  image, and the user forgets to specify block-migration, the error
  message returned is this:

  "Live migration can not be used without shared storage except a booted
  from volume VM which does not have a local disk."

  This has a couple things wrong with it. First, the triple-negative is
  a bit confusing, especially for non-native-english speakers.  Second,
  it implies that you cannot do a block migration, which is obviously
  false.

  I think a more clear error message would be something like:

  "Shared storage migration requires either shared storage or boot-from-
  volume with no local disks."

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1690890/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1688673] [NEW] cpu_realtime_mask handling is not intuitive

2017-05-05 Thread Chris Friesen
Public bug reported:

The nova code implicitly assumes that all vCPUs are realtime in
nova.virt.hardware.vcpus_realtime_topology(), and then it appends the
user-specified mask.

This only makes sense if the user-specified cpu_realtime_mask is an
exclusion mask, but this isn't documented anywhere.

It would make more sense to simply use the mask as passed-in from the
end-user.

In order to preserve the current behaviour we should probably special-
case the scenario where the passed-in cpu_realtime_mask starts with a
"^" (indicating an exclusion).

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

** Description changed:

  The nova code implicitly assumes that all vCPUs are realtime in
- nova.virt.hardware.vcpus_realtime_topology().
+ nova.virt.hardware.vcpus_realtime_topology(), and then it appends the
+ user-specified mask.
  
- This only makes sense if the cpu_realtime_mask is an exclusion mask, but
- this isn't documented anywhere.
+ This only makes sense if the user-specified cpu_realtime_mask is an
+ exclusion mask, but this isn't documented anywhere.
  
  It would make more sense to simply use the mask as passed-in from the
  end-user.
  
  In order to preserve the current behaviour we should probably special-
  case the scenario where the passed-in cpu_realtime_mask starts with a
  "^" (indicating an exclusion).

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1688673

Title:
  cpu_realtime_mask handling is not intuitive

Status in OpenStack Compute (nova):
  New

Bug description:
  The nova code implicitly assumes that all vCPUs are realtime in
  nova.virt.hardware.vcpus_realtime_topology(), and then it appends the
  user-specified mask.

  This only makes sense if the user-specified cpu_realtime_mask is an
  exclusion mask, but this isn't documented anywhere.

  It would make more sense to simply use the mask as passed-in from the
  end-user.

  In order to preserve the current behaviour we should probably special-
  case the scenario where the passed-in cpu_realtime_mask starts with a
  "^" (indicating an exclusion).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1688673/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1688599] [NEW] resource audit races against evacuating instance

2017-05-05 Thread Chris Friesen
Public bug reported:

We recently hit an issue where an evacuating instance with dedicated
cpu_policy being pinned to same host CPUs as other instances with
dedicated cpu_policy. During subsequent resource audits we would see cpu
pinning errors.

The root cause appears to be the fact that the resource audit skips the
evacuating instance during migration phase of audit while instance was
rebuilding on new host.  It appears that _instance_in_resize_state()
returned "false" because the vm_state was vm_states.ERROR.  We allow
rebuilding from the ERROR state though, so we should consider it.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1688599

Title:
  resource audit races against evacuating instance

Status in OpenStack Compute (nova):
  New

Bug description:
  We recently hit an issue where an evacuating instance with dedicated
  cpu_policy being pinned to same host CPUs as other instances with
  dedicated cpu_policy. During subsequent resource audits we would see
  cpu pinning errors.

  The root cause appears to be the fact that the resource audit skips
  the evacuating instance during migration phase of audit while instance
  was rebuilding on new host.  It appears that
  _instance_in_resize_state() returned "false" because the vm_state was
  vm_states.ERROR.  We allow rebuilding from the ERROR state though, so
  we should consider it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1688599/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1687067] [NEW] problems with cpu and cpu-thread policy where flavor/image specify different settings

2017-04-28 Thread Chris Friesen
Public bug reported:

There are a number of issues related to CPU policy and CPU thread policy
where the flavor extra-spec and image properties do not match up.

The docs at https://docs.openstack.org/admin-guide/compute-cpu-
topologies.html say the following:

"Image metadata takes precedence over flavor extra specs. Thus,
configuring competing policies causes an exception. By setting a shared
policy through image metadata, administrators can prevent users
configuring CPU policies in flavors and impacting resource utilization."

For the CPU policy this is exactly backwards based on the code.  The
flavor is specified by the admin, and so it generally takes priority
over the image which is specified by the end user.  If the flavor
specifies "dedicated" then the result is dedicated regardless of what
the image specifies.  If the flavor specifies "shared" then the result
depends on the image--if it specifies "dedicated" then we will raise an
exception, otherwise we use "shared".  If the flavor doesn't specify a
CPU policy then the image can specify whatever policy it wants.

The issue around CPU threading policy is more complicated.

Back in Mitaka, if the flavor specified a CPU threading policy of either
None or "prefer" then we would use the threading policy specified by the
image (if it was set).  If the flavor specified a CPU threading policy
of "isolate" or "require" and the image specified a different CPU
threading policy then we raised
exception.ImageCPUThreadPolicyForbidden(), otherwise we used the CPU
threading policy specified by the flavor.  This behaviour is described
in the spec at https://specs.openstack.org/openstack/nova-
specs/specs/mitaka/implemented/virt-driver-cpu-thread-pinning.html

In git commit 24997343 (which went into Newton) Nikola Dipanov made a
code change that doesn't match the intent in the git commit message:

 if flavor_thread_policy in [None, fields.CPUThreadAllocationPolicy.PREFER]:
-cpu_thread_policy = image_thread_policy
+cpu_thread_policy = flavor_thread_policy or image_thread_policy

The effect of this is that if the flavor specifies a CPU threading
policy of "prefer" then we will use a policy of "prefer" regardless of
the policy from the image.  If the flavor specifies a CPU threading
policy of None then we will use the policy from the image.

This is a bug, because the original intent was to treat None and
"prefer" identically, since "prefer" was just an explicit way to specify
the default behaviour.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1687067

Title:
  problems with cpu and cpu-thread policy where flavor/image specify
  different settings

Status in OpenStack Compute (nova):
  New

Bug description:
  There are a number of issues related to CPU policy and CPU thread
  policy where the flavor extra-spec and image properties do not match
  up.

  The docs at https://docs.openstack.org/admin-guide/compute-cpu-
  topologies.html say the following:

  "Image metadata takes precedence over flavor extra specs. Thus,
  configuring competing policies causes an exception. By setting a
  shared policy through image metadata, administrators can prevent users
  configuring CPU policies in flavors and impacting resource
  utilization."

  For the CPU policy this is exactly backwards based on the code.  The
  flavor is specified by the admin, and so it generally takes priority
  over the image which is specified by the end user.  If the flavor
  specifies "dedicated" then the result is dedicated regardless of what
  the image specifies.  If the flavor specifies "shared" then the result
  depends on the image--if it specifies "dedicated" then we will raise
  an exception, otherwise we use "shared".  If the flavor doesn't
  specify a CPU policy then the image can specify whatever policy it
  wants.

  The issue around CPU threading policy is more complicated.

  Back in Mitaka, if the flavor specified a CPU threading policy of
  either None or "prefer" then we would use the threading policy
  specified by the image (if it was set).  If the flavor specified a CPU
  threading policy of "isolate" or "require" and the image specified a
  different CPU threading policy then we raised
  exception.ImageCPUThreadPolicyForbidden(), otherwise we used the CPU
  threading policy specified by the flavor.  This behaviour is described
  in the spec at https://specs.openstack.org/openstack/nova-
  specs/specs/mitaka/implemented/virt-driver-cpu-thread-pinning.html

  In git commit 24997343 (which went into Newton) Nikola Dipanov made a
  code change that doesn't match the intent in the git commit message:

   if flavor_thread_policy in [None, 
fields.CPUThreadAllocationPolicy.PREFER]:
  -cpu_thread_policy = image_thread_policy
  +

[Yahoo-eng-team] [Bug 1669054] [NEW] RequestSpec.ignore_hosts from resize is reused in subsequent evacuate

2017-03-01 Thread Chris Friesen
Public bug reported:

When doing a resize, if CONF.allow_resize_to_same_host is False, then we
set RequestSpec.ignore_hosts and then save the RequestSpec.

When we go to use the same RequestSpec on a subsequent rebuild/evacuate,
ignore_hosts is still set from the previous resize.

In RequestSpec.reset_forced_destinations() we reset force_hosts and
force_nodes, it might make sense to also reset ignore_hosts.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1669054

Title:
  RequestSpec.ignore_hosts from resize is reused in subsequent evacuate

Status in OpenStack Compute (nova):
  New

Bug description:
  When doing a resize, if CONF.allow_resize_to_same_host is False, then
  we set RequestSpec.ignore_hosts and then save the RequestSpec.

  When we go to use the same RequestSpec on a subsequent
  rebuild/evacuate, ignore_hosts is still set from the previous resize.

  In RequestSpec.reset_forced_destinations() we reset force_hosts and
  force_nodes, it might make sense to also reset ignore_hosts.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1669054/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1573288] Re: over time, horizon's admin -> overview page becomes very slow ....

2017-01-25 Thread Chris Friesen
*** This bug is a duplicate of bug 1508571 ***
https://bugs.launchpad.net/bugs/1508571

** This bug has been marked a duplicate of bug 1508571
   Overview panels use too wide date range as default

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1573288

Title:
   over time, horizon's admin -> overview page becomes very slow 

Status in OpenStack Dashboard (Horizon):
  Incomplete

Bug description:
  I've noticed that when logging into the admin account after a bunch of
  activity against the RDO installation, it takes a very long time (many
  minutes) before horizon loads (I think the issue is the overview admin
  page which is also the main landing page for logging in).

  The list includes overall activity including deleted projects.  If you
  orchestrate lots of testing against the installation using "rally" you
  will see lots of projects get created and later deleted.  As such I
  have an overview page which lists at the bottom:

  "Displaying 2035 items"

  Is it possible to do something about the Overview page either by
  displaying only the first 20 items, or changing the type of
  information being displayed?  Logging into admin is very painful
  currently.  Non-admin accounts login quickly.

  
  Version-Release number of selected component (if applicable):

  Liberty

  How reproducible:

  Always.

  Steps to Reproduce:

  Run rally against openstack in an endless loop.  After a few days (or
  hours depending on what you do and how you do it) you will find
  horizon getting slower and slower.

  Originally reported against RDO here:
  https://bugzilla.redhat.com/show_bug.cgi?id=1329414

  though this is likely a general issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1573288/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1654345] Re: realtime emulatorpin should use pcpus, not vcpus

2017-01-05 Thread Chris Friesen
Looks like this has already been dealt with on Master via bug 1614054,
commit 6683bf9.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1654345

Title:
  realtime emulatorpin should use pcpus, not vcpus

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  When specifying "hw:cpu_realtime_mask" in the flavor,
  LibvirtDriver._get_guest_numa_config() calls
  hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and
  "vcpus_em".  It then directly uses "vcpus_em" to set the "emulatorpin"
  cpuset.

  The problem is that libvirt expects the "emulatorpin" cpuset to be
  specified as physical CPUs, not virtual CPUs.

  This results in unexpected values being used for the emulator pinning.

  The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the
  pCPUs to the "emulatorpin" cpuset.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1654345/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1654345] [NEW] realtime emulatorpin should use pcpus, not vcpus

2017-01-05 Thread Chris Friesen
Public bug reported:

When specifying "hw:cpu_realtime_mask" in the flavor,
LibvirtDriver._get_guest_numa_config() calls
hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and
"vcpus_em".  It then directly uses "vcpus_em" to set the "emulatorpin"
cpuset.

The problem is that libvirt expects the "emulatorpin" cpuset to be
specified as physical CPUs, not virtual CPUs.

This results in unexpected values being used for the emulator pinning.

The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the
pCPUs to the "emulatorpin" cpuset.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: New


** Tags: compute libvirt newton-backport-potential

** Description changed:

  When specifying "hw:cpu_realtime_mask" in the flavor,
  LibvirtDriver._get_guest_numa_config() calls
  hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and
  "vcpus_em".  It then directly uses "vcpus_em" to set the "emulatorpin"
  cpuset.
  
  The problem is that libvirt expects the "emulatorpin" cpuset to be
  specified as physical CPUs, not virtual CPUs.
  
+ This results in unexpected values being used for the emulator pinning.
+ 
  The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the
  pCPUs to the "emulatorpin" cpuset.

** Changed in: nova
 Assignee: (unassigned) => Chris Friesen (cbf123)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1654345

Title:
  realtime emulatorpin should use pcpus, not vcpus

Status in OpenStack Compute (nova):
  New

Bug description:
  When specifying "hw:cpu_realtime_mask" in the flavor,
  LibvirtDriver._get_guest_numa_config() calls
  hardware.vcpus_realtime_topology() to calculate "vcpus_rt" and
  "vcpus_em".  It then directly uses "vcpus_em" to set the "emulatorpin"
  cpuset.

  The problem is that libvirt expects the "emulatorpin" cpuset to be
  specified as physical CPUs, not virtual CPUs.

  This results in unexpected values being used for the emulator pinning.

  The fix is to convert "vcpus_em" from vCPUs to pCPUs, and assign the
  pCPUs to the "emulatorpin" cpuset.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1654345/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1638961] [NEW] evacuating an instance loses files specified via "--file" on the cli

2016-11-03 Thread Chris Friesen
Public bug reported:

I booted up an instance as follows in my stable/mitaka devstack
environment:

$ echo "this is a test" > /tmp/my_user_data.txt
$ echo "blah1" > /tmp/file1
$ echo "blah2" > /tmp/file2
$ nova boot --flavor m1.tiny --image cirros-0.3.4-x86_64-uec  --config-drive 
true --user-data /tmp/my_user_data.txt --file /root/file1=/tmp/file1 --file 
/tmp/file2=/tmp/file2 testing


This booted up an instance, and within the guest I ran the following:

$ mkdir mnt
$ mount /dev/sr0 mnt
$ cat mnt/openstack/latest/user_data
this is a test
$ umount mnt
$ cat /root/file1
blah1
$ cat /tmp/file2
blah2

Then I killed the compute node and ran "nova evacuate testing".

The evacuated instance had a config drive at /dev/sr0, but it did not
have the /root/file1 or /tmp/file2 files.  This is arguably incorrect.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1638961

Title:
  evacuating an instance loses files specified via "--file" on the cli

Status in OpenStack Compute (nova):
  New

Bug description:
  I booted up an instance as follows in my stable/mitaka devstack
  environment:

  $ echo "this is a test" > /tmp/my_user_data.txt
  $ echo "blah1" > /tmp/file1
  $ echo "blah2" > /tmp/file2
  $ nova boot --flavor m1.tiny --image cirros-0.3.4-x86_64-uec  --config-drive 
true --user-data /tmp/my_user_data.txt --file /root/file1=/tmp/file1 --file 
/tmp/file2=/tmp/file2 testing

  
  This booted up an instance, and within the guest I ran the following:

  $ mkdir mnt
  $ mount /dev/sr0 mnt
  $ cat mnt/openstack/latest/user_data
  this is a test
  $ umount mnt
  $ cat /root/file1
  blah1
  $ cat /tmp/file2
  blah2

  Then I killed the compute node and ran "nova evacuate testing".

  The evacuated instance had a config drive at /dev/sr0, but it did not
  have the /root/file1 or /tmp/file2 files.  This is arguably incorrect.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1638961/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1613488] Re: changed fields of versionedobjects not tracked properly when down-versioning object

2016-08-29 Thread Chris Friesen
The review for the oslo.versionedobjects change is here:
https://review.openstack.org/#/c/355981/

** Changed in: nova
   Status: New => Fix Released

** Project changed: nova => oslo.versionedobjects

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1613488

Title:
  changed fields of versionedobjects not tracked properly when down-
  versioning object

Status in oslo.versionedobjects:
  Fix Released

Bug description:
  Sorry for the complicated write-up below, but the issue is
  complicated.

  
  I'm running into a problem between Mitaka and Kilo, but I *think* it'll also 
hit Mitaka/Liberty.  The problem scenario is when we have older and newer 
services talking to each other.  The problem occurs when nova-conductor writes 
to an object field that is removed in obj_make_compatible().  In particular, 
I'm hitting this with 'parent_addr' in the PciDevice class since it gets 
written in PciDevice._from_db_object().

  In oslo_versionedobjects/base.py the remotable() function has the following 
line:
self._changed_fields = set(updates.get('obj_what_changed', []))

  This blindly sets the local self._changed_fields to be whatever the
  remote end sent as updates['obj_what_changed'].

  This is a problem because the far end can include fields that don't
  actually exist in the older object version.  On the far end (which may
  be newer) in nova.conductor.manager.ConductorManager.object_action(),
  we will call the following (where 'objinst' is the current version of
  the object):

  updates['obj_what_changed'] = objinst.obj_what_changed()

  Since this is called against the newer object code, it can specify
  fields that do not exist in the older version of the object if nova-
  conductor has written those fields.

  The only workaround I've been able to come up with for this is to
  modify oslo_versionedobjects.base.remotable() to only include a field
  in self._changed_fields if it's in self.fields.  This requires
  updating the older code prior to an upgrade, however.

  
  I think there's another related issue.  In VersionedObject.obj_to_primitive() 
we set the changes in the primitive like this:

  if self.obj_what_changed():
  obj[self._obj_primitive_key('changes')] = list(
  self.obj_what_changed())

  Since we call self.obj_what_changed() on the newer version of the
  object, I think we will include changes to fields that were removed by
  obj_make_compatible_from_manifest().

  It seems to me that in obj_to_primitive() we should not allow fields
  to be included in obj[self._obj_primitive_key('changes')] unless
  they're also listed in obj[self._obj_primitive_key('data')].

To manage notifications about this bug go to:
https://bugs.launchpad.net/oslo.versionedobjects/+bug/1613488/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1613488] [NEW] changed fields of versionedobjects not tracked properly when down-versioning object

2016-08-15 Thread Chris Friesen
Public bug reported:

Sorry for the complicated write-up below, but the issue is complicated.


I'm running into a problem between Mitaka and Kilo, but I *think* it'll also 
hit Mitaka/Liberty.  The problem scenario is when we have older and newer 
services talking to each other.  The problem occurs when nova-conductor writes 
to an object field that is removed in obj_make_compatible().  In particular, 
I'm hitting this with 'parent_addr' in the PciDevice class since it gets 
written in PciDevice._from_db_object().

In oslo_versionedobjects/base.py the remotable() function has the following 
line:
self._changed_fields = set(updates.get('obj_what_changed', []))

This blindly sets the local self._changed_fields to be whatever the
remote end sent as updates['obj_what_changed'].

This is a problem because the far end can include fields that don't
actually exist in the older object version.  On the far end (which may
be newer) in nova.conductor.manager.ConductorManager.object_action(), we
will call the following (where 'objinst' is the current version of the
object):

updates['obj_what_changed'] = objinst.obj_what_changed()

Since this is called against the newer object code, it can specify
fields that do not exist in the older version of the object if nova-
conductor has written those fields.

The only workaround I've been able to come up with for this is to modify
oslo_versionedobjects.base.remotable() to only include a field in
self._changed_fields if it's in self.fields.  This requires updating the
older code prior to an upgrade, however.


I think there's another related issue.  In VersionedObject.obj_to_primitive() 
we set the changes in the primitive like this:

if self.obj_what_changed():
obj[self._obj_primitive_key('changes')] = list(
self.obj_what_changed())

Since we call self.obj_what_changed() on the newer version of the
object, I think we will include changes to fields that were removed by
obj_make_compatible_from_manifest().

It seems to me that in obj_to_primitive() we should not allow fields to
be included in obj[self._obj_primitive_key('changes')] unless they're
also listed in obj[self._obj_primitive_key('data')].

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute oslo

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1613488

Title:
  changed fields of versionedobjects not tracked properly when down-
  versioning object

Status in OpenStack Compute (nova):
  New

Bug description:
  Sorry for the complicated write-up below, but the issue is
  complicated.

  
  I'm running into a problem between Mitaka and Kilo, but I *think* it'll also 
hit Mitaka/Liberty.  The problem scenario is when we have older and newer 
services talking to each other.  The problem occurs when nova-conductor writes 
to an object field that is removed in obj_make_compatible().  In particular, 
I'm hitting this with 'parent_addr' in the PciDevice class since it gets 
written in PciDevice._from_db_object().

  In oslo_versionedobjects/base.py the remotable() function has the following 
line:
self._changed_fields = set(updates.get('obj_what_changed', []))

  This blindly sets the local self._changed_fields to be whatever the
  remote end sent as updates['obj_what_changed'].

  This is a problem because the far end can include fields that don't
  actually exist in the older object version.  On the far end (which may
  be newer) in nova.conductor.manager.ConductorManager.object_action(),
  we will call the following (where 'objinst' is the current version of
  the object):

  updates['obj_what_changed'] = objinst.obj_what_changed()

  Since this is called against the newer object code, it can specify
  fields that do not exist in the older version of the object if nova-
  conductor has written those fields.

  The only workaround I've been able to come up with for this is to
  modify oslo_versionedobjects.base.remotable() to only include a field
  in self._changed_fields if it's in self.fields.  This requires
  updating the older code prior to an upgrade, however.

  
  I think there's another related issue.  In VersionedObject.obj_to_primitive() 
we set the changes in the primitive like this:

  if self.obj_what_changed():
  obj[self._obj_primitive_key('changes')] = list(
  self.obj_what_changed())

  Since we call self.obj_what_changed() on the newer version of the
  object, I think we will include changes to fields that were removed by
  obj_make_compatible_from_manifest().

  It seems to me that in obj_to_primitive() we should not allow fields
  to be included in obj[self._obj_primitive_key('changes')] unless
  they're also listed in obj[self._obj_primitive_key('data')].

To manage notifications about this bug go to:

[Yahoo-eng-team] [Bug 1605720] [NEW] backing store missing for ephemeral disk on migration with boot-from-vol

2016-07-22 Thread Chris Friesen
Public bug reported:

I'm on stable/mitaka, but the master code looks similar.

I have compute nodes configured to use qcow2 and libvirt.  The flavor
has an ephemeral disk and a swap disk.  I boot an instance with this
flavor, and the instance is boot-from-volume.

When I try to cold-migrate the instance, I get an error:
2016-07-21 23:33:48.561 46340 ERROR nova.compute.manager [instance: 
4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb] libvirtError: Cannot access backing file 
'/etc/nova/instances/_base/ephemeral_1_0706d66' of storage file 
'/etc/nova/instances/4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb/disk.eph0' (as uid:0, 
gid:0): No such file or directory


The problem seems to be that in 
nova.virt.libvirt.driver.LibvirtDriver.finish_migration() we call 
self._create_image(...block_device_info=None...)

Down in _create_image() we handle the case of a "disk.local" ephemeral
device, but that doesn't help because the device is actually named
"disk.eph0".   It looks like we then try to loop over any ephemerals in
block_device_info, but that's None so we don't handle any of those
(which is too bad since it looks like they would be named correctly).

The end result is that we have a qcow2 "disk.eph0" image, but with
potentially no backing store in /_base.  When we tell
libvirt to start the instance, this results in the above error.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1605720

Title:
  backing store missing for ephemeral disk on migration with boot-from-
  vol

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm on stable/mitaka, but the master code looks similar.

  I have compute nodes configured to use qcow2 and libvirt.  The flavor
  has an ephemeral disk and a swap disk.  I boot an instance with this
  flavor, and the instance is boot-from-volume.

  When I try to cold-migrate the instance, I get an error:
  2016-07-21 23:33:48.561 46340 ERROR nova.compute.manager [instance: 
4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb] libvirtError: Cannot access backing file 
'/etc/nova/instances/_base/ephemeral_1_0706d66' of storage file 
'/etc/nova/instances/4e52bfd8-0c71-48dc-89fb-6f6b31dc06bb/disk.eph0' (as uid:0, 
gid:0): No such file or directory

  
  The problem seems to be that in 
nova.virt.libvirt.driver.LibvirtDriver.finish_migration() we call 
self._create_image(...block_device_info=None...)

  Down in _create_image() we handle the case of a "disk.local" ephemeral
  device, but that doesn't help because the device is actually named
  "disk.eph0".   It looks like we then try to loop over any ephemerals
  in block_device_info, but that's None so we don't handle any of those
  (which is too bad since it looks like they would be named correctly).

  The end result is that we have a qcow2 "disk.eph0" image, but with
  potentially no backing store in /_base.  When we tell
  libvirt to start the instance, this results in the above error.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1605720/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1602814] [NEW] hyperthreading bug in NUMATopologyFilter

2016-07-13 Thread Chris Friesen
Public bug reported:

I recently ran into an issue where I was trying to boot an instance with
8 vCPUs, with hw:cpu_policy=dedicated.  The host had 8 pCPUs available,
but they were a mix of siblings and non-siblings.

In virt.hardware._pack_instance_onto_cores(), the _get_pinning()
function seems to be the culprit.  It was called with the following
inputs:

(Pdb) threads_no
1
(Pdb) sibling_set
[CoercedSet([63]), CoercedSet([49]), CoercedSet([48]), CoercedSet([50]), 
CoercedSet([59, 15]), CoercedSet([18, 62])]
(Pdb) instance_cell.cpuset
CoercedSet([0, 1, 2, 3, 4, 5, 6, 7])

As we can see, we are looking for 8 vCPUs, and there are 8 pCPUs
available.  However, when we call _get_pinning() it doesn't give us a
mapping:

> /usr/lib/python2.7/site-packages/nova/virt/hardware.py(899)_pack_instance_onto_cores()
-> pinning = _get_pinning(threads_no, sibling_set,
(Pdb) n
> /usr/lib/python2.7/site-packages/nova/virt/hardware.py(900)_pack_instance_onto_cores()
-> instance_cell.cpuset)
(Pdb) n
> /usr/lib/python2.7/site-packages/nova/virt/hardware.py(901)_pack_instance_onto_cores()
-> if pinning:
(Pdb) pinning


This is a bug, if we haven't specified anything regarding hyperthreading then 
we should be able to run with a mix of siblings and non-siblings.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute numa scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1602814

Title:
  hyperthreading bug in NUMATopologyFilter

Status in OpenStack Compute (nova):
  New

Bug description:
  I recently ran into an issue where I was trying to boot an instance
  with 8 vCPUs, with hw:cpu_policy=dedicated.  The host had 8 pCPUs
  available, but they were a mix of siblings and non-siblings.

  In virt.hardware._pack_instance_onto_cores(), the _get_pinning()
  function seems to be the culprit.  It was called with the following
  inputs:

  (Pdb) threads_no
  1
  (Pdb) sibling_set
  [CoercedSet([63]), CoercedSet([49]), CoercedSet([48]), CoercedSet([50]), 
CoercedSet([59, 15]), CoercedSet([18, 62])]
  (Pdb) instance_cell.cpuset
  CoercedSet([0, 1, 2, 3, 4, 5, 6, 7])

  As we can see, we are looking for 8 vCPUs, and there are 8 pCPUs
  available.  However, when we call _get_pinning() it doesn't give us a
  mapping:

  > 
/usr/lib/python2.7/site-packages/nova/virt/hardware.py(899)_pack_instance_onto_cores()
  -> pinning = _get_pinning(threads_no, sibling_set,
  (Pdb) n
  > 
/usr/lib/python2.7/site-packages/nova/virt/hardware.py(900)_pack_instance_onto_cores()
  -> instance_cell.cpuset)
  (Pdb) n
  > 
/usr/lib/python2.7/site-packages/nova/virt/hardware.py(901)_pack_instance_onto_cores()
  -> if pinning:
  (Pdb) pinning

  
  This is a bug, if we haven't specified anything regarding hyperthreading then 
we should be able to run with a mix of siblings and non-siblings.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1602814/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1600304] [NEW] _update_usage_from_migrations() can end up processing stale migrations

2016-07-08 Thread Chris Friesen
Public bug reported:

I recently found a bug in Mitaka, and it appears to be still present in
master.

I was testing a separate patch by doing resizes, and bugs in my code had
resulted in a number of incomplete resizes involving compute-1.  I then
did a resize from compute-0 to compute-0, and saw compute-1's resource
usage go up when it ran the resource audit.

This got me curious, so I went digging and discovered a gap in the current 
resource audit logic.  The problem arises if:

1) You have one or more stale migrations which didn't complete
properly that involve the current compute node.

2) The instance from the uncompleted migration is currently doing a
resize/migration that does not involve the current compute node.

When this happens, _update_usage_from_migrations() will be passed in the stale 
migration, and since the instance is in fact in a resize state, the current 
compute node will erroneously account for the instance.  (Even though the 
instance isn't doing anything involving the current compute node.)

The fix is to check that the instance migration ID matches the ID of the 
migration being analyzed.  This will work because in the case of the stale 
migration we will have hit the error case in _pair_instances_to_migrations(), 
and so the instance will be lazy-loaded from the DB, ensuring that its 
migration ID is up-to-date.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: In Progress


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1600304

Title:
  _update_usage_from_migrations() can end up processing stale migrations

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  I recently found a bug in Mitaka, and it appears to be still present
  in master.

  I was testing a separate patch by doing resizes, and bugs in my code
  had resulted in a number of incomplete resizes involving compute-1.  I
  then did a resize from compute-0 to compute-0, and saw compute-1's
  resource usage go up when it ran the resource audit.

  This got me curious, so I went digging and discovered a gap in the current 
resource audit logic.  The problem arises if:
  
  1) You have one or more stale migrations which didn't complete
  properly that involve the current compute node.
  
  2) The instance from the uncompleted migration is currently doing a
  resize/migration that does not involve the current compute node.
  
  When this happens, _update_usage_from_migrations() will be passed in the 
stale migration, and since the instance is in fact in a resize state, the 
current compute node will erroneously account for the instance.  (Even though 
the instance isn't doing anything involving the current compute node.)
  
  The fix is to check that the instance migration ID matches the ID of the 
migration being analyzed.  This will work because in the case of the stale 
migration we will have hit the error case in _pair_instances_to_migrations(), 
and so the instance will be lazy-loaded from the DB, ensuring that its 
migration ID is up-to-date.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1600304/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1213224] Re: nova allows multiple aggregates with same zone name

2016-06-30 Thread Chris Friesen
Just to clarify something, availability zones don't "have" host
aggregates.  Rather, some host aggregates *are also* availability zones,
but a given host can only be in one availability zone.

I went and looked at the code, and the way it is currently written I
think it is actually okay to have multiple host aggregates specifying
the same availability zone.

The logic in the AvailabilityZoneFilter is basically to loop over all
host aggregates for the host in question, and if one of them has an
availability zone (there should be only one) then the filter will check
it against the availability zone specified by the user.

As such, I'm going to close this bug.

** Changed in: nova
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1213224

Title:
  nova allows multiple aggregates with same zone name

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Currently (on grizzly), nova will let you specify multiple aggregates
  with the same zone name.

  This seems like a mismatch since the end-user can only specify an
  availability zone when creating an instance, and there could be
  multiple aggregates (with different hosts) mapping to that zone.

  On aggregate creation, nova should ensure that the availability zone
  name (if specified) is not a duplicate of an existing availability
  zone name.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1213224/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1590607] [NEW] incorrect handling of host numa cell usage with instances having no numa topology

2016-06-08 Thread Chris Friesen
Public bug reported:

I think there is a problem in host NUMA node resource tracking when
there is an instance with no numa topology on the same node as instances
with numa topology.

It's triggered while running the resource audit, which ultimately calls
hardware.get_host_numa_usage_from_instance() and assigns the result to
self.compute_node.numa_topology.

The problem occurs if you have a number of instances with numa topology,
and then an instance with no numa topology. When running
numa_usage_from_instances() for the instance with no numa topology we
cache the values of "memory_usage" and "cpu_usage". However, because
instance.cells is empty we don't enter the loop. Since the two lines in
this commit are indented too far they don't get called, and we end up
appending a host cell with "cpu_usage" and "memory_usage" of zero.
This results in a host numa_topology cell with incorrect "cpu_usage" and
"memory_usage" values, though I think the overall host cpu/memory usage
is still correct.

The fix is to reduce the indentation of the two lines in question so
that they get called even when the instance has no numa topology. This
writes the original host cell usage information back to it.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: New


** Tags: compute scheduler

** Changed in: nova
 Assignee: (unassigned) => Chris Friesen (cbf123)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1590607

Title:
  incorrect handling of host numa cell usage with instances having no
  numa topology

Status in OpenStack Compute (nova):
  New

Bug description:
  I think there is a problem in host NUMA node resource tracking when
  there is an instance with no numa topology on the same node as
  instances with numa topology.

  It's triggered while running the resource audit, which ultimately
  calls hardware.get_host_numa_usage_from_instance() and assigns the
  result to self.compute_node.numa_topology.

  The problem occurs if you have a number of instances with numa
  topology, and then an instance with no numa topology. When running
  numa_usage_from_instances() for the instance with no numa topology we
  cache the values of "memory_usage" and "cpu_usage". However, because
  instance.cells is empty we don't enter the loop. Since the two lines
  in this commit are indented too far they don't get called, and we end
  up appending a host cell with "cpu_usage" and "memory_usage" of zero.
  This results in a host numa_topology cell with incorrect "cpu_usage"
  and "memory_usage" values, though I think the overall host cpu/memory
  usage is still correct.

  The fix is to reduce the indentation of the two lines in question so
  that they get called even when the instance has no numa topology. This
  writes the original host cell usage information back to it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1590607/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1590133] [NEW] help text for cpu_allocation_ratio is wrong

2016-06-07 Thread Chris Friesen
Public bug reported:

In stable/mitaka in resource_tracker.py the help text for the
cpu_allocation_ratio config option reads in part:

 'NOTE: This can be set per-compute, or if set to 0.0, the value '
 'set on the scheduler node(s) will be used '
 'and defaulted to 16.0'),

However, there is no longer any value set on the scheduler node(s).
They use the per-compute-node value set in resource_tracker.py.

Instead, if the value is 0.0 then ComputeNode._from_db_object() will
convert the value to 16.0.  This ensures that the scheduler filters see
a value of 16.0 by default.

In Newton the plan appears to be to change the default value to an
explicit 16.0 (and presumably updating the help text) but that doesn't
help the already-released Mitaka code.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1590133

Title:
  help text for cpu_allocation_ratio is wrong

Status in OpenStack Compute (nova):
  New

Bug description:
  In stable/mitaka in resource_tracker.py the help text for the
  cpu_allocation_ratio config option reads in part:

   'NOTE: This can be set per-compute, or if set to 0.0, the value '
   'set on the scheduler node(s) will be used '
   'and defaulted to 16.0'),

  However, there is no longer any value set on the scheduler node(s).
  They use the per-compute-node value set in resource_tracker.py.

  Instead, if the value is 0.0 then ComputeNode._from_db_object() will
  convert the value to 16.0.  This ensures that the scheduler filters
  see a value of 16.0 by default.

  In Newton the plan appears to be to change the default value to an
  explicit 16.0 (and presumably updating the help text) but that doesn't
  help the already-released Mitaka code.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1590133/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1590091] [NEW] bug in handling of ISOLATE thread policy

2016-06-07 Thread Chris Friesen
Public bug reported:

I'm running stable/mitaka in devstack.  I've got a small system with 2
pCPUs, both marked as available for pinning.  They're two cores of a
single processor, no threads.  "virsh capabilities" shows:

  


  

It is my understanding that I should be able to boot up an instance with
two dedicated CPUs and a thread policy of ISOLATE, since I have two
physical cores and no threads.  (Is this correct?)

Unfortunately, the NUMATopology filter fails my host.  The problem is in
_pack_instance_onto_cores():

if (instance_cell.cpu_thread_policy ==
fields.CPUThreadAllocationPolicy.ISOLATE):
# make sure we have at least one fully free core
if threads_per_core not in sibling_sets:
return

pinning = _get_pinning(1,  # we only want to "use" one thread per core
   sibling_sets[threads_per_core],
   instance_cell.cpuset)


Right before the call to _get_pinning() we have the following:

(Pdb) instance_cell.cpu_thread_policy
u'isolate'
(Pdb) threads_per_core
1
(Pdb) sibling_sets 
defaultdict(, {1: [CoercedSet([0, 1])], 2: [CoercedSet([0, 1])]})
(Pdb) sibling_sets[threads_per_core]
[CoercedSet([0, 1])]
(Pdb) instance_cell.cpuset
CoercedSet([0, 1])

In this code snippet, _get_pinning() returns None, causing the filter to
fail the host.  Tracing a bit further in, in _get_pinning() we have the
following line:

if threads_no * len(sibling_set) < len(instance_cores):
return

Coming into this line of code the variables look like this:

(Pdb) threads_no
1
(Pdb) sibling_set
[CoercedSet([0, 1])]
(Pdb) len(sibling_set)
1
(Pdb) instance_cores
CoercedSet([0, 1])
(Pdb) len(instance_cores)
2

So the test evaluates to True, and we bail out.

I don't think this is correct, we should be able to schedule on this
host.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1590091

Title:
  bug in handling of ISOLATE thread policy

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm running stable/mitaka in devstack.  I've got a small system with 2
  pCPUs, both marked as available for pinning.  They're two cores of a
  single processor, no threads.  "virsh capabilities" shows:


  
  


  It is my understanding that I should be able to boot up an instance
  with two dedicated CPUs and a thread policy of ISOLATE, since I have
  two physical cores and no threads.  (Is this correct?)

  Unfortunately, the NUMATopology filter fails my host.  The problem is
  in _pack_instance_onto_cores():

  if (instance_cell.cpu_thread_policy ==
  fields.CPUThreadAllocationPolicy.ISOLATE):
  # make sure we have at least one fully free core
  if threads_per_core not in sibling_sets:
  return

  pinning = _get_pinning(1,  # we only want to "use" one thread per core
 sibling_sets[threads_per_core],
 instance_cell.cpuset)

  
  Right before the call to _get_pinning() we have the following:

  (Pdb) instance_cell.cpu_thread_policy
  u'isolate'
  (Pdb) threads_per_core
  1
  (Pdb) sibling_sets 
  defaultdict(, {1: [CoercedSet([0, 1])], 2: [CoercedSet([0, 1])]})
  (Pdb) sibling_sets[threads_per_core]
  [CoercedSet([0, 1])]
  (Pdb) instance_cell.cpuset
  CoercedSet([0, 1])

  In this code snippet, _get_pinning() returns None, causing the filter
  to fail the host.  Tracing a bit further in, in _get_pinning() we have
  the following line:

  if threads_no * len(sibling_set) < len(instance_cores):
  return

  Coming into this line of code the variables look like this:

  (Pdb) threads_no
  1
  (Pdb) sibling_set
  [CoercedSet([0, 1])]
  (Pdb) len(sibling_set)
  1
  (Pdb) instance_cores
  CoercedSet([0, 1])
  (Pdb) len(instance_cores)
  2

  So the test evaluates to True, and we bail out.

  I don't think this is correct, we should be able to schedule on this
  host.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1590091/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1577642] [NEW] race between disk_available_least and instance operations

2016-05-02 Thread Chris Friesen
Public bug reported:

The calculation for LibvirtDriver._get_disk_over_committed_size_total()
loops over all the instances on the hypervisor to try to figure out the
total overcommitted size for all instances.

However, at the time that routine is called from
ResourceTracker.update_available_resource()  we do not hold
COMPUTE_RESOURCE_SEMAPHORE.  This means that instance claims can be
modified (due to instance creation/deletion/resize/migration/etc),
potentially causing the calculated value for
data['disk_available_least'] to not actually reflect current reality,
and potentially allowing different eventlets to have different views of
data['disk_available_least'].

There was a related bug reported some time back
(https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with
the underlying race condition they just sort of papered over it by
ignoring the InstanceNotFound exception.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute race-condition

** Description changed:

  The calculation for LibvirtDriver._get_disk_over_committed_size_total()
  loops over all the instances on the hypervisor to try to figure out the
  total overcommitted size for all instances.
  
  However, at the time that routine is called from
  ResourceTracker.update_available_resource()  we do not hold
  COMPUTE_RESOURCE_SEMAPHORE.  This means that instances can be
  created/destroyed/resized, causing the calculated value for
  data['disk_available_least'] to not actually reflect current reality.
+ 
+ There was a related bug reported some time back
+ (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with
+ the underlying race condition they just sort of papered over it by
+ ignoring the InstanceNotFound exception.

** Description changed:

  The calculation for LibvirtDriver._get_disk_over_committed_size_total()
  loops over all the instances on the hypervisor to try to figure out the
  total overcommitted size for all instances.
  
  However, at the time that routine is called from
  ResourceTracker.update_available_resource()  we do not hold
- COMPUTE_RESOURCE_SEMAPHORE.  This means that instances can be
- created/destroyed/resized, causing the calculated value for
- data['disk_available_least'] to not actually reflect current reality.
+ COMPUTE_RESOURCE_SEMAPHORE.  This means that instance claims can be
+ modified (due to instance creation/deletion/resize/migration/etc),
+ causing the calculated value for data['disk_available_least'] to not
+ actually reflect current reality.
  
  There was a related bug reported some time back
  (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with
  the underlying race condition they just sort of papered over it by
  ignoring the InstanceNotFound exception.

** Description changed:

  The calculation for LibvirtDriver._get_disk_over_committed_size_total()
  loops over all the instances on the hypervisor to try to figure out the
  total overcommitted size for all instances.
  
  However, at the time that routine is called from
  ResourceTracker.update_available_resource()  we do not hold
  COMPUTE_RESOURCE_SEMAPHORE.  This means that instance claims can be
  modified (due to instance creation/deletion/resize/migration/etc),
- causing the calculated value for data['disk_available_least'] to not
- actually reflect current reality.
+ potentially causing the calculated value for
+ data['disk_available_least'] to not actually reflect current reality,
+ and potentially allowing different eventlets to have different views of
+ data['disk_available_least'].
  
  There was a related bug reported some time back
  (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal with
  the underlying race condition they just sort of papered over it by
  ignoring the InstanceNotFound exception.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1577642

Title:
  race between disk_available_least and instance operations

Status in OpenStack Compute (nova):
  New

Bug description:
  The calculation for
  LibvirtDriver._get_disk_over_committed_size_total() loops over all the
  instances on the hypervisor to try to figure out the total
  overcommitted size for all instances.

  However, at the time that routine is called from
  ResourceTracker.update_available_resource()  we do not hold
  COMPUTE_RESOURCE_SEMAPHORE.  This means that instance claims can be
  modified (due to instance creation/deletion/resize/migration/etc),
  potentially causing the calculated value for
  data['disk_available_least'] to not actually reflect current reality,
  and potentially allowing different eventlets to have different views
  of data['disk_available_least'].

  There was a related bug reported some time back
  (https://bugs.launchpad.net/nova/+bug/968339) but rather than deal
  with the underlying race condition they just sort of 

[Yahoo-eng-team] [Bug 1552777] [NEW] resizing from flavor with swap to one without swap puts instance into Error status

2016-03-03 Thread Chris Friesen
Public bug reported:

In a single-node devstack (current trunk, nova commit 6e1051b7), if you
boot an instance with a flavor that has nonzero swap and then resize to
a flavor with zero swap it causes an exception.  It seems that we
somehow neglect to remove the swap file from the instance.

 2016-03-03 10:02:41.415 ERROR nova.virt.libvirt.guest 
[req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] Error launching a 
defined domain with XML: 
  instance-0001
  54711b56-fa72-4eac-a5d3-aa29ed128098
  
http://openstack.org/xmlns/libvirt/nova/1.0;>
  
  asdf
  2016-03-03 16:02:39
  
512
1
0
0
1
  
  
admin
alt_demo
  
  

  
  524288
  524288
  1
  
1024
  
  

  OpenStack Foundation
  OpenStack Nova
  13.0.0
  03000200-0400-0500-0006-000700080009
  54711b56-fa72-4eac-a5d3-aa29ed128098
  Virtual Machine

  
  
hvm

/opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/kernel

/opt/stack/data/nova/instances/54711b56-fa72-4eac-a5d3-aa29ed128098/ramdisk
root=/dev/vda console=tty0 console=ttyS0


  
  


  
  

  
  



  
  destroy
  restart
  destroy
  
/usr/bin/kvm-spice

  
  
  
  


  
  
  
  


  
  
  
  
  


  



  


  
  
  
  
  


  
  


  


  
  




  


  
  


  
  

  


2016-03-03 10:02:41.417 ERROR nova.compute.manager 
[req-dadee404-81c4-46de-9fd5-58de747b3b78 admin alt_demo] [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] Setting instance vm_state to ERROR
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] Traceback (most recent call last):
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/compute/manager.py", line 3999, in finish_resize
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] disk_info, image_meta)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/compute/manager.py", line 3964, in _finish_resize
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] old_instance_type)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in 
__exit__
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] six.reraise(self.type_, self.value, 
self.tb)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/compute/manager.py", line 3959, in _finish_resize
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] block_device_info, power_on)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 7202, in finish_migration
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] vifs_already_plugged=True)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 4862, in 
_create_domain_and_network
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] xml, pause=pause, power_on=power_on)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 4793, in _create_domain
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] guest.launch(pause=pause)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/opt/stack/nova/nova/virt/libvirt/guest.py", line 142, in launch
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] self._encoded_xml, errors='ignore')
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 
"/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in 
__exit__
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098] six.reraise(self.type_, self.value, 
self.tb)
2016-03-03 10:02:41.417 TRACE nova.compute.manager [instance: 
54711b56-fa72-4eac-a5d3-aa29ed128098]   File 

[Yahoo-eng-team] [Bug 1549032] [NEW] max_net_count doesn't interact properly with min_count when booting multiple instances

2016-02-23 Thread Chris Friesen
Public bug reported:

In compute.api.API._create_instance() we have a min_count that is
optionally passed in by the end user as part of the boot request.

We calculate max_net_count based on networking constraints.

Currently we error out if max_net_count is zero, but we don't check it
against min_count.  If the end user specifies a min_count that is
greater than the calculated  max_net_count the resulting error isn't
very useful.

We know that min_count is guaranteed to be at least 1, so we can replace
the existing test against zero with one against min_count.  Doing this
gives a much more reasonable error message:

controller-0:~$ nova boot --image myimage --flavor simple --min-count 2 
--max-count 3 test
ERROR (Forbidden): Maximum number of ports exceeded (HTTP 403) (Request-ID: 
req-f7ff28bf-5708-4cbf-a634-2e9686afd970)

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1549032

Title:
  max_net_count doesn't interact properly with min_count when booting
  multiple instances

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  In compute.api.API._create_instance() we have a min_count that is
  optionally passed in by the end user as part of the boot request.

  We calculate max_net_count based on networking constraints.

  Currently we error out if max_net_count is zero, but we don't check it
  against min_count.  If the end user specifies a min_count that is
  greater than the calculated  max_net_count the resulting error isn't
  very useful.

  We know that min_count is guaranteed to be at least 1, so we can
  replace the existing test against zero with one against min_count.
  Doing this gives a much more reasonable error message:

  controller-0:~$ nova boot --image myimage --flavor simple --min-count 2 
--max-count 3 test
  ERROR (Forbidden): Maximum number of ports exceeded (HTTP 403) (Request-ID: 
req-f7ff28bf-5708-4cbf-a634-2e9686afd970)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1549032/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1542039] [NEW] nova should not reschedule an instance that has already been deleted

2016-02-04 Thread Chris Friesen
Public bug reported:

I'm investigating an issue where an instance with a large disk and an
attached cinder volume was booted in a stable/kilo OpenStack setup with
the diskFilter disabled.

The timeline looks like this:
scheduler picks initial compute node
nova attempts to boot it up on one compute node, it runs out of disk space and 
gets rescheduled
 scheduler picks another compute node
user requests instance deletion
user requests cinder volume deletion
nova attempts to boot it up on second compute node, it runs out of disk space 
and gets rescheduled
scheduler picks a third compute node
nova  attempts to boot it up on third compute node, runs into problems due to 
missing cinder volume


The issue I want to address in this bug is whether it makes sense to reschedule 
the instance when the instance has already been deleted.

Also, instance deletion sets the task_state to 'deleting' early on.  In
compute.manager.ComputeManager._do_build_and_run_instance(), if we
decide to reschedule then nova-compute will set the task_state to
'scheduling' and then save the instance, which I think could overwrite
the 'deleting' state in the DB.

So...would it make sense to have nova-compute put an
"expected_task_state" on the instance.save() call that sets the
'scheduling' task_state?

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1542039

Title:
  nova should not reschedule an instance that has already been deleted

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm investigating an issue where an instance with a large disk and an
  attached cinder volume was booted in a stable/kilo OpenStack setup
  with the diskFilter disabled.

  The timeline looks like this:
  scheduler picks initial compute node
  nova attempts to boot it up on one compute node, it runs out of disk space 
and gets rescheduled
   scheduler picks another compute node
  user requests instance deletion
  user requests cinder volume deletion
  nova attempts to boot it up on second compute node, it runs out of disk space 
and gets rescheduled
  scheduler picks a third compute node
  nova  attempts to boot it up on third compute node, runs into problems due to 
missing cinder volume

  
  The issue I want to address in this bug is whether it makes sense to 
reschedule the instance when the instance has already been deleted.

  Also, instance deletion sets the task_state to 'deleting' early on.
  In compute.manager.ComputeManager._do_build_and_run_instance(), if we
  decide to reschedule then nova-compute will set the task_state to
  'scheduling' and then save the instance, which I think could overwrite
  the 'deleting' state in the DB.

  So...would it make sense to have nova-compute put an
  "expected_task_state" on the instance.save() call that sets the
  'scheduling' task_state?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1542039/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1538619] [NEW] Fix up argument order in remove_volume_connection()

2016-01-27 Thread Chris Friesen
Public bug reported:

The RPC API function for remove_volume_connection() uses a different argument 
order than the ComputeManager function of the same name.

The normal RPC code uses named arguments, but the _ComputeV4Proxy version 
doesn't, and it has the order wrong.  This causes problems when called by 
_rollback_live_migration().

The fix seems to be trivial:
diff --git a/nova/compute/manager.py b/nova/compute/manager.py
index d6efd18..65c1b75 100644
--- a/nova/compute/manager.py
+++ b/nova/compute/manager.py
@@ -6870,7 +6870,8 @@ class _ComputeV4Proxy(object):
   instance)
 
 def remove_volume_connection(self, ctxt, instance, volume_id):
-return self.manager.remove_volume_connection(ctxt, instance, volume_id)
+# The RPC API uses different argument order than the local API.
+return self.manager.remove_volume_connection(ctxt, volume_id, instance)
 
 def rescue_instance(self, ctxt, instance, rescue_password,
 rescue_image_ref, clean_shutdown):

Given that this only applies to stable/kilo I'm guessing there's no
point in trying to push a patch, but I thought I'd include this here in
case anyone else runs into it.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1538619

Title:
  Fix up argument order in remove_volume_connection()

Status in OpenStack Compute (nova):
  New

Bug description:
  The RPC API function for remove_volume_connection() uses a different argument 
order than the ComputeManager function of the same name.
  
  The normal RPC code uses named arguments, but the _ComputeV4Proxy version 
doesn't, and it has the order wrong.  This causes problems when called by 
_rollback_live_migration().

  The fix seems to be trivial:
  diff --git a/nova/compute/manager.py b/nova/compute/manager.py
  index d6efd18..65c1b75 100644
  --- a/nova/compute/manager.py
  +++ b/nova/compute/manager.py
  @@ -6870,7 +6870,8 @@ class _ComputeV4Proxy(object):
 instance)
   
   def remove_volume_connection(self, ctxt, instance, volume_id):
  -return self.manager.remove_volume_connection(ctxt, instance, 
volume_id)
  +# The RPC API uses different argument order than the local API.
  +return self.manager.remove_volume_connection(ctxt, volume_id, 
instance)
   
   def rescue_instance(self, ctxt, instance, rescue_password,
   rescue_image_ref, clean_shutdown):

  Given that this only applies to stable/kilo I'm guessing there's no
  point in trying to push a patch, but I thought I'd include this here
  in case anyone else runs into it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1538619/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1536703] [NEW] unable to re-issue confirm/revert of resize

2016-01-21 Thread Chris Friesen
Public bug reported:

If we call confirm_resize() that sets migration.status to 'confirming'
and sends an  RPC cast to the compute node.

If there's a glitch and that cast is received but never processed,
there's no way to confirm the resize since it only looks for migrations
with a status of "finished".  It looks like it should be safe as-is to
allow calling confirm_resize on a migration in the "confirming" state
since it's already synchronized on the instance.

A similar problem holds for an interrupted revert_resize(), but in that
case there's no synchronization currently.  Not sure if that's a problem
or not.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1536703

Title:
  unable to re-issue confirm/revert of resize

Status in OpenStack Compute (nova):
  New

Bug description:
  If we call confirm_resize() that sets migration.status to 'confirming'
  and sends an  RPC cast to the compute node.

  If there's a glitch and that cast is received but never processed,
  there's no way to confirm the resize since it only looks for
  migrations with a status of "finished".  It looks like it should be
  safe as-is to allow calling confirm_resize on a migration in the
  "confirming" state since it's already synchronized on the instance.

  A similar problem holds for an interrupted revert_resize(), but in
  that case there's no synchronization currently.  Not sure if that's a
  problem or not.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1536703/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1528325] [NEW] instance with explicit "small" pages treated different from implicit

2015-12-21 Thread Chris Friesen
Public bug reported:

In numa_get_constraints() we call

pagesize = _numa_get_pagesize_constraints(flavor, image_meta)

then later we have

if nodes or pagesize:

[setattr(c, 'pagesize', pagesize) for c in numa_topology.cells]


This ends up treating an instance which doesn't specify pagesize (which results 
in 4K pages) differently from an instance that explicitly specifies 4K pages.  
In the first case the instance may not have a numa topology specified, while in 
the second case it does.

In _get_guest_numa_config() we check whether the guest has a numa
topology, and if it does we restrict it to a single NUMA node rather
than letting it float across the whole host.  This unexpectedly results
in different CPU and memory affinity depending on whether an instance
implicitly assumes 4K pages or explicitly specifies them.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute numa

** Summary changed:

- explicit "small" pages treated different from implicit
+ instance with explicit "small" pages treated different from implicit

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1528325

Title:
  instance with explicit "small" pages treated different from implicit

Status in OpenStack Compute (nova):
  New

Bug description:
  In numa_get_constraints() we call

  pagesize = _numa_get_pagesize_constraints(flavor, image_meta)

  then later we have

  if nodes or pagesize:
  
  [setattr(c, 'pagesize', pagesize) for c in numa_topology.cells]

  
  This ends up treating an instance which doesn't specify pagesize (which 
results in 4K pages) differently from an instance that explicitly specifies 4K 
pages.  In the first case the instance may not have a numa topology specified, 
while in the second case it does.

  In _get_guest_numa_config() we check whether the guest has a numa
  topology, and if it does we restrict it to a single NUMA node rather
  than letting it float across the whole host.  This unexpectedly
  results in different CPU and memory affinity depending on whether an
  instance implicitly assumes 4K pages or explicitly specifies them.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1528325/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1512907] [NEW] leak of vswitch port if delete an instance while resizing

2015-11-03 Thread Chris Friesen
Public bug reported:

I've been testing with a modified version of stable/kilo, but I believe
the bug is present in upstream stable/kilo.

When using nova with neutron, if I boot an instance, then trigger a
resize, and then delete the instance at just the right point during the
resize it ends up causing a vswitch port to be "leaked".

So far I've been able to show that if I issue the "nova delete" command
while the resize operation is anywhere in
nova.compute.manager.ComputeManager._finish_resize() up to the point
where we set "instance.vm_state = vm_states.RESIZED" then I end up
"leaking" a vswitch port.  (By "leaking" I mean that it stays allocated
even after the instance that it was allocated for is deleted.)  I've
been testing this by calling pdb.set_trace() to pause the resize while
the nova delete runs, then letting the resize continue.  Yes, this
exaggerates the timing issues, but it shouldn't introduce any new races
if the code is correct.

I think the problem occurs because the deletion path can't confirm the
migration/resize because it hasn't gotten to the proper state yet.  The
resize code takes various exceptions depending on the exact timing of
when the deletion happens, but it doesn't trigger a revert of the resize
and it doesn't clean up the vswitch port on the source host.  See sample
log below.

I'm not sure what the proper fix is for this case.  It seems that until
we set "instance.vm_state = vm_states.RESIZED" it should be up to the
resize code to clean up all resources if the instance gets deleted while
a resize is in progress.


Sample log on source compute node.  This is with a pause right at the beginning 
of _finish_resize().

(Pdb) c
2015-11-03 23:28:37.968 17000 INFO nova.compute.resource_tracker 
[req-2d9812a5-eadf-4eb2-96a3-d46e496b292d - - - - -] Auditing locally available 
compute resources for node compute-1
2015-11-03 23:28:38.511 17000 INFO nova.network.neutronv2.api 
[req-46519d0e-6e80-40e1-bbb3-dc39184eb046 41f42dfc41f9428fb143623f0a83d2fa 
726f4a1ce23f4f12acb9139dcfcdb313 - - -] [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557] Port dad525cf-75af-47a7-a57c-cc6fa26a6cf2 
from network info_cache is no longer associated with instance in Neutron. 
Removing from network info_cache.
2015-11-03 23:28:38.677 17000 ERROR nova.compute.manager 
[req-46519d0e-6e80-40e1-bbb3-dc39184eb046 41f42dfc41f9428fb143623f0a83d2fa 
726f4a1ce23f4f12acb9139dcfcdb313 - - -] [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557] Setting instance vm_state to ERROR
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557] Traceback (most recent call last):
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"/usr/lib64/python2.7/site-packages/nova/compute/manager.py", line 4350, in 
finish_resize
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557] disk_info, image)
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"/usr/lib64/python2.7/site-packages/nova/compute/manager.py", line 4247, in 
_finish_resize
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557] resize_instance = False
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7866, 
in macs_for_instance
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/pci/manager.py", line 380, in 
get_instance_pci_devs
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/objects/base.py", line 72, in getter
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/objects/instance.py", line 1022, in 
obj_load_attr
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/objects/instance.py", line 904, in 
_load_generic
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/objects/base.py", line 161, in wrapper
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"./usr/lib64/python2.7/site-packages/nova/conductor/rpcapi.py", line 335, in 
object_class_action
2015-11-03 23:28:38.677 17000 TRACE nova.compute.manager [instance: 
21a4ff59-9057-4bc2-8b5e-185374c2d557]   File 
"/usr/lib64/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in 
call
2015-11-03 23:28:38.677 

[Yahoo-eng-team] [Bug 1471997] Re: nova MAX_FUNC value in nova/pci/devspec.py is too low

2015-08-20 Thread Chris Friesen
Jay Pipes helpfully pointed out that the MAX_FUNC value was defined by
the PCI spec, and didn't refer to the SRIOV VF value, but rather the PCI
device function.

The original issue turned out to be a local problem generating the PCI
whitelist.

** Changed in: nova
   Status: In Progress = Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1471997

Title:
  nova MAX_FUNC value in nova/pci/devspec.py is too low

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  The MAX_FUNC value in nova/pci/devspec.py is set to 0x7.  This limits
  us to a relatively small number of VFs per PF, which is annoying when
  trying to use SRIOV in any sort of serious way.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1471997/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1484742] Re: NUMATopologyFilter doesn't account for CPU/RAM overcommit

2015-08-17 Thread Chris Friesen
** Changed in: nova
   Status: In Progress = Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1484742

Title:
  NUMATopologyFilter doesn't account for CPU/RAM overcommit

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  There seems to be a bug in the NUMATopologyFilter where it doesn't
  properly account for cpu_allocation_ratio or ram_allocation_ratio.
  (Detected on stable/kilo, not sure if it applies to current master.)

  To reproduce:

  1) Create a flavor with a moderate number of CPUs (5, for example) and
  enable hugepages by setting   hw:mem_page_size=2048 in the flavor
  extra specs.  Do not specify dedicated CPUs on the flavor.

  2) Ensure that the available compute nodes have fewer CPUs free than
  the number of CPUs in the flavor above.

  3) Ensure that the cpu_allocation_ratio is big enough that
  num_free_cpus * cpu_allocation_ratio is more than the number of CPUs
  in the flavor above.

  4) Enable the NUMATopologyFilter for the nova filter scheduler.

  5) Try to boot an instance with the specified flavor.

  This should pass, because we're not using dedicated CPUs and so the
  cpu_allocation_ratio should apply.  However, the NUMATopologyFilter
  returns 0 hosts.

  It seems like the NUMATopologyFilter is failing to properly account
  for the cpu_allocation_ratio when checking whether an instance can fit
  onto a given host.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1484742/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1485631] [NEW] CPU/RAM overcommit treated differently by normal and NUMA topology case

2015-08-17 Thread Chris Friesen
Public bug reported:

Currently in the NUMA topology case (so multi-node guest, dedicated
CPUs, hugepages in the guest, etc.) a single guest is not allowed to
consume more CPU/RAM than the host actually has in total regardless of
the specified overcommit ratio.  In other words, the overcommit ratio
only applies when the host resources are being used by multiple guests.
A given host resource can only be used once by any particular guest.

So as an example, if the host has 2 pCPUs in total for guests, a single
guest instance is not allowed to use more than 2CPUs but you might be
able to have 16 such instances running. (Assuming default CPU overcommit
ratio.)

However, this is not true when the NUMA topology is not involved.  In
that case a host with 2 pCPUs would allow a guest with 3 vCPUs to be
spawned.

We should pick one behaviour as correct and adjust the other one to
match.  Given that the NUMA topology case was discussed more recently,
it seems reasonable to select it as the correct behaviour.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1485631

Title:
  CPU/RAM overcommit treated differently by normal and NUMA topology
  case

Status in OpenStack Compute (nova):
  New

Bug description:
  Currently in the NUMA topology case (so multi-node guest, dedicated
  CPUs, hugepages in the guest, etc.) a single guest is not allowed to
  consume more CPU/RAM than the host actually has in total regardless of
  the specified overcommit ratio.  In other words, the overcommit ratio
  only applies when the host resources are being used by multiple
  guests.  A given host resource can only be used once by any particular
  guest.

  So as an example, if the host has 2 pCPUs in total for guests, a
  single guest instance is not allowed to use more than 2CPUs but you
  might be able to have 16 such instances running. (Assuming default CPU
  overcommit ratio.)

  However, this is not true when the NUMA topology is not involved.  In
  that case a host with 2 pCPUs would allow a guest with 3 vCPUs to be
  spawned.

  We should pick one behaviour as correct and adjust the other one to
  match.  Given that the NUMA topology case was discussed more recently,
  it seems reasonable to select it as the correct behaviour.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1485631/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1484742] [NEW] NUMATopologyFilter doesn't account for cpu_allocation_ratio

2015-08-13 Thread Chris Friesen
Public bug reported:

There seems to be a bug in the NUMATopologyFilter where it doesn't
properly account for cpu_allocation_ratio.  (Detected on stable/kilo,
not sure if it applies to current master.)

To reproduce:

1) Create a flavor with a moderate number of CPUs (5, for example) and
enable hugepages by setting   hw:mem_page_size=2048 in the flavor
extra specs.  Do not specify dedicated CPUs on the flavor.

2) Ensure that the available compute nodes have fewer CPUs free than the
number of CPUs in the flavor above.

3) Ensure that the cpu_allocation_ratio is big enough that
num_free_cpus * cpu_allocation_ratio is more than the number of CPUs
in the flavor above.

4) Enable the NUMATopologyFilter for the nova filter scheduler.

5) Try to boot an instance with the specified flavor.

This should pass, because we're not using dedicated CPUs and so the
cpu_allocation_ratio should apply.  However, the NUMATopologyFilter
returns 0 hosts.

It seems like the NUMATopologyFilter is failing to properly account for
the cpu_allocation_ratio when checking whether an instance can fit onto
a given host.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute scheduler

** Description changed:

  There seems to be a bug in the NUMATopologyFilter where it doesn't
- properly account for cpu_allocation_ratio.
+ properly account for cpu_allocation_ratio.  (Detected on stable/kilo,
+ not sure if it applies to current master.)
  
  To reproduce:
  
  1) Create a flavor with a moderate number of CPUs (5, for example) and
  enable hugepages by setting   hw:mem_page_size=2048 in the flavor
  extra specs.  Do not specify dedicated CPUs on the flavor.
  
  2) Ensure that the available compute nodes have fewer CPUs free than the
  number of CPUs in the flavor above.
  
  3) Ensure that the cpu_allocation_ratio is big enough that
  num_free_cpus * cpu_allocation_ratio is more than the number of CPUs
  in the flavor above.
  
  4) Enable the NUMATopologyFilter for the nova filter scheduler.
  
  5) Try to boot an instance with the specified flavor.
  
  This should pass, because we're not using dedicated CPUs and so the
  cpu_allocation_ratio should apply.  However, the NUMATopologyFilter
  returns 0 hosts.
  
  It seems like the NUMATopologyFilter is failing to properly account for
  the cpu_allocation_ratio when checking whether an instance can fit onto
  a given host.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1484742

Title:
  NUMATopologyFilter doesn't account for cpu_allocation_ratio

Status in OpenStack Compute (nova):
  New

Bug description:
  There seems to be a bug in the NUMATopologyFilter where it doesn't
  properly account for cpu_allocation_ratio.  (Detected on stable/kilo,
  not sure if it applies to current master.)

  To reproduce:

  1) Create a flavor with a moderate number of CPUs (5, for example) and
  enable hugepages by setting   hw:mem_page_size=2048 in the flavor
  extra specs.  Do not specify dedicated CPUs on the flavor.

  2) Ensure that the available compute nodes have fewer CPUs free than
  the number of CPUs in the flavor above.

  3) Ensure that the cpu_allocation_ratio is big enough that
  num_free_cpus * cpu_allocation_ratio is more than the number of CPUs
  in the flavor above.

  4) Enable the NUMATopologyFilter for the nova filter scheduler.

  5) Try to boot an instance with the specified flavor.

  This should pass, because we're not using dedicated CPUs and so the
  cpu_allocation_ratio should apply.  However, the NUMATopologyFilter
  returns 0 hosts.

  It seems like the NUMATopologyFilter is failing to properly account
  for the cpu_allocation_ratio when checking whether an instance can fit
  onto a given host.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1484742/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1482416] [NEW] bug blocks DB migration that changes column type

2015-08-06 Thread Chris Friesen
Public bug reported:

I'm trying to make the following change as a DB migration

+# Table instances, modify field 'vcpus_used' to Float
+compute_nodes = Table('compute_nodes', meta, autoload=True)
+vcpus_used = getattr(compute_nodes.c, 'vcpus_used')
+vcpus_used.alter(type=Float)


This works at runtime (using PostgreSQL) but when running the unit tests (using 
sqlite) I get the following:


nova.tests.unit.db.test_migrations.TestNovaMigrationsSQLite.test_models_sync


Captured traceback:
~~~
Traceback (most recent call last):
  File 
/home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_migrations.py,
 line 588, in test_models_sync
Models and migration scripts aren't in sync:\n%s % msg)
  File 
/home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/unittest2/case.py,
 line 690, in fail
raise self.failureException(msg)
AssertionError: Models and migration scripts aren't in sync:
[ ( 'add_constraint',
UniqueConstraint(Column('host', String(length=255), 
table=compute_nodes), Column('hypervisor_hostname', String(length=255), 
table=compute_nodes), Column('deleted', Integer(), table=compute_nodes, 
default=ColumnDefault(0]



This appears to be an interaction between two things, the change I made to 
alter the vcpus_used column, and a previous change (commit 2db4a1ac, migration 
version 279) to add the uniq_compute_nodes0host0hypervisor_hostname constraint. 
 sqlite doesn't support altering columns, so there's a workaround that makes a 
new column, copies the contents over, and deletes the old one...this seems to 
be running into problems with the modified constraint.

I suspect that this means that anyone wanting to change the type of a
column in the compute_nodes table will run into a similar problem.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute db

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1482416

Title:
  bug blocks DB migration that changes column type

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm trying to make the following change as a DB migration

  +# Table instances, modify field 'vcpus_used' to Float
  +compute_nodes = Table('compute_nodes', meta, autoload=True)
  +vcpus_used = getattr(compute_nodes.c, 'vcpus_used')
  +vcpus_used.alter(type=Float)

  
  This works at runtime (using PostgreSQL) but when running the unit tests 
(using sqlite) I get the following:

  
  nova.tests.unit.db.test_migrations.TestNovaMigrationsSQLite.test_models_sync
  

  Captured traceback:
  ~~~
  Traceback (most recent call last):
File 
/home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/oslo_db/sqlalchemy/test_migrations.py,
 line 588, in test_models_sync
  Models and migration scripts aren't in sync:\n%s % msg)
File 
/home/cfriesen/devel/wrlinux-x/addons/wr-cgcs/layers/cgcs/git/nova/.tox/py27/lib/python2.7/site-packages/unittest2/case.py,
 line 690, in fail
  raise self.failureException(msg)
  AssertionError: Models and migration scripts aren't in sync:
  [ ( 'add_constraint',
  UniqueConstraint(Column('host', String(length=255), 
table=compute_nodes), Column('hypervisor_hostname', String(length=255), 
table=compute_nodes), Column('deleted', Integer(), table=compute_nodes, 
default=ColumnDefault(0]
  

  
  This appears to be an interaction between two things, the change I made to 
alter the vcpus_used column, and a previous change (commit 2db4a1ac, migration 
version 279) to add the uniq_compute_nodes0host0hypervisor_hostname constraint. 
 sqlite doesn't support altering columns, so there's a workaround that makes a 
new column, copies the contents over, and deletes the old one...this seems to 
be running into problems with the modified constraint.

  I suspect that this means that anyone wanting to change the type of a
  column in the compute_nodes table will run into a similar problem.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1482416/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1471997] [NEW] nova MAX_FUNC value in nova/pci/devspec.py is too low

2015-07-06 Thread Chris Friesen
Public bug reported:

The MAX_FUNC value in nova/pci/devspec.py is set to 0x7.  This limits us
to a relatively small number of VFs per PF, which is annoying when
trying to use SRIOV in any sort of serious way.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: New


** Tags: compute

** Tags added: compute

** Changed in: nova
 Assignee: (unassigned) = Chris Friesen (cbf123)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1471997

Title:
  nova MAX_FUNC value in nova/pci/devspec.py is too low

Status in OpenStack Compute (Nova):
  New

Bug description:
  The MAX_FUNC value in nova/pci/devspec.py is set to 0x7.  This limits
  us to a relatively small number of VFs per PF, which is annoying when
  trying to use SRIOV in any sort of serious way.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1471997/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1461678] [NEW] nova error handling causes glance to keep unlinked files open, wasting space

2015-06-03 Thread Chris Friesen
Public bug reported:

When creating larger glance images (like a 10GB CentOS7 image), if we
run into situation where we run out of room on the destination device,
we cannot recover the space from glance. glance-api will have open
unlinked files, so a TONNE of space is unavailable until we restart
glance-api.

Nova will try to reschedule the instance 3 times, so should see this 
nova-conductor.log :
u'RescheduledException: Build of instance 98ca2c0d-44b2-48a6-b1af-55f4b2db73c1 
was re-scheduled: [Errno 28] No space left on device\n']

The problem is this code in
nova.image.glance.GlanceImageService.download():

if data is None:
return image_chunks
else:
try:
for chunk in image_chunks:
data.write(chunk)
finally:
if close_file:
data.close()

image_chunks is an iterator.  If we take an exception (like we can't
write the file because the filesystem is full) then we will stop
iterating over the chunks.  If we don't iterate over all the chunks then
glance will keep the file open.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: New


** Tags: compute

** Changed in: nova
 Assignee: (unassigned) = Chris Friesen (cbf123)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1461678

Title:
  nova error handling causes glance to keep unlinked files open, wasting
  space

Status in OpenStack Compute (Nova):
  New

Bug description:
  When creating larger glance images (like a 10GB CentOS7 image), if we
  run into situation where we run out of room on the destination device,
  we cannot recover the space from glance. glance-api will have open
  unlinked files, so a TONNE of space is unavailable until we restart
  glance-api.

  Nova will try to reschedule the instance 3 times, so should see this 
nova-conductor.log :
  u'RescheduledException: Build of instance 
98ca2c0d-44b2-48a6-b1af-55f4b2db73c1 was re-scheduled: [Errno 28] No space left 
on device\n']

  The problem is this code in
  nova.image.glance.GlanceImageService.download():

  if data is None:
  return image_chunks
  else:
  try:
  for chunk in image_chunks:
  data.write(chunk)
  finally:
  if close_file:
  data.close()

  image_chunks is an iterator.  If we take an exception (like we can't
  write the file because the filesystem is full) then we will stop
  iterating over the chunks.  If we don't iterate over all the chunks
  then glance will keep the file open.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1461678/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1459782] [NEW] _is_storage_shared_with() in libvirt/driver.py gives possibly false results if ssh keys not configured

2015-05-28 Thread Chris Friesen
Public bug reported:

In virt.libvirt.driver.LibvirtDriver._is_storage_shared_with() we first
check IP addresses and if they don't match then we'll try to use ssh to
check whether the storage is actually shared or not.

If ssh keys are not set up between the compute nodes for the user
running nova-compute then the call to utils.ssh_execute() will fail and
we will return wrong information.

utils.ssh_execute() is also used in _cleanup_remote_migration() and
migrate_disk_and_power_off() and would suffer from similar issues there.

Either we need to ensure that the requirement for pre-sharing the ssh
keys is clearly documented, or we need to convert these to to use RPC
calls like live migration.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1459782

Title:
  _is_storage_shared_with() in libvirt/driver.py gives possibly false
  results if ssh keys not configured

Status in OpenStack Compute (Nova):
  New

Bug description:
  In virt.libvirt.driver.LibvirtDriver._is_storage_shared_with() we
  first check IP addresses and if they don't match then we'll try to use
  ssh to check whether the storage is actually shared or not.

  If ssh keys are not set up between the compute nodes for the user
  running nova-compute then the call to utils.ssh_execute() will fail
  and we will return wrong information.

  utils.ssh_execute() is also used in _cleanup_remote_migration() and
  migrate_disk_and_power_off() and would suffer from similar issues
  there.

  Either we need to ensure that the requirement for pre-sharing the ssh
  keys is clearly documented, or we need to convert these to to use RPC
  calls like live migration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1459782/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1458122] [NEW] nova shouldn't error if we can't schedule all of max_count instances at boot time

2015-05-22 Thread Chris Friesen
Public bug reported:

When booting up instances, nova allows the user to specify a min count
and a max count.

Currently, if the user has quota space for max count instances, then
nova will try to create them all.  If any of them can't be scheduled,
then the creation of all of them will be aborted and they will all be
put into an error state.

Arguably, if nova was able to schedule at least min count instances
(which defaults to 1) then it should continue on with creating those
instances that it was able to schedule.  Only if nova cannot create at
least min count instances should nova actually consider the request as
failed.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1458122

Title:
  nova shouldn't error if we can't schedule all of max_count instances
  at boot time

Status in OpenStack Compute (Nova):
  New

Bug description:
  When booting up instances, nova allows the user to specify a min
  count and a max count.

  Currently, if the user has quota space for max count instances, then
  nova will try to create them all.  If any of them can't be scheduled,
  then the creation of all of them will be aborted and they will all be
  put into an error state.

  Arguably, if nova was able to schedule at least min count instances
  (which defaults to 1) then it should continue on with creating those
  instances that it was able to schedule.  Only if nova cannot create at
  least min count instances should nova actually consider the request
  as failed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1458122/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1454451] [NEW] simultaneous boot of multiple instances leads to cpu pinning overlap

2015-05-12 Thread Chris Friesen
Public bug reported:

I'm running into an issue with kilo-3 that I think is present in current
trunk.

I think there is a race between the claimed CPUs of an instance being
persisted to the DB, and the resource audit scanning the DB for
instances and subtracting pinned CPUs from the list of available CPUs.

The problem only shows up when the following sequence happens:
1) instance A (with dedicated cpus) boots on a compute node
2) resource audit runs on that compute node
3) instance B (with dedicated cpus) boots on the same compute node

So you need to either be booting many instances, or limiting the valid
compute nodes (host aggregate or server groups), or have a small cluster
in order to hit this.


The nitty-gritty view looks like this:

When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in
compute.resource_tracker.ResourceTracker.instance_claim() and that
covers updating the resource usage on the compute node. But we don't
persist the instance numa topology to the database until after
instance_claim() returns, in
compute.manager.ComputeManager._build_instance().  Note that this is
done *after* we've given up the semaphore, so there is no longer any
sort of ordering guarantee.

compute.resource_tracker.ResourceTracker.update_available_resource()
then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a list
of instances and uses that to calculate a new view of what resources are
available. If the numa topology of the most recent instance hasn't been
persisted yet, then the new view of resources won't include any pCPUs
pinned by that instance.

compute.manager.ComputeManager._build_instance() runs for the next
instance and based on the new view of available resources it allocates
the same pCPU(s) used by the earlier instance. Boom, overlapping pinned
pCPUs.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1454451

Title:
  simultaneous boot of multiple instances leads to cpu pinning overlap

Status in OpenStack Compute (Nova):
  New

Bug description:
  I'm running into an issue with kilo-3 that I think is present in
  current trunk.

  I think there is a race between the claimed CPUs of an instance being
  persisted to the DB, and the resource audit scanning the DB for
  instances and subtracting pinned CPUs from the list of available CPUs.

  The problem only shows up when the following sequence happens:
  1) instance A (with dedicated cpus) boots on a compute node
  2) resource audit runs on that compute node
  3) instance B (with dedicated cpus) boots on the same compute node

  So you need to either be booting many instances, or limiting the valid
  compute nodes (host aggregate or server groups), or have a small
  cluster in order to hit this.

  
  The nitty-gritty view looks like this:

  When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in
  compute.resource_tracker.ResourceTracker.instance_claim() and that
  covers updating the resource usage on the compute node. But we don't
  persist the instance numa topology to the database until after
  instance_claim() returns, in
  compute.manager.ComputeManager._build_instance().  Note that this is
  done *after* we've given up the semaphore, so there is no longer any
  sort of ordering guarantee.

  compute.resource_tracker.ResourceTracker.update_available_resource()
  then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a
  list of instances and uses that to calculate a new view of what
  resources are available. If the numa topology of the most recent
  instance hasn't been persisted yet, then the new view of resources
  won't include any pCPUs pinned by that instance.

  compute.manager.ComputeManager._build_instance() runs for the next
  instance and based on the new view of available resources it allocates
  the same pCPU(s) used by the earlier instance. Boom, overlapping
  pinned pCPUs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1454451/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1444171] [NEW] evacuate code path is not updating binding:host_id in neutron

2015-04-14 Thread Chris Friesen
Public bug reported:

Currently when using neutron we don't update the binding:host_id during
the evacuate code path.

This can cause the evacuation to fail if we go to sleep waiting for
events in
virt.libvirt.driver.LibvirtDriver._create_domain_and_network().  Since
the binding:host_id in neutron is still pointing to the old host,
neutron will never send us any events and we will eventually time out.

I was able to get the evacuation to update the binding by modifying
compute.manager.ComputeManager.rebuild_instance() to add a call to
self.network_api.setup_instance_network_on_host() right below the
existing call to self.network_api.setup_networks_on_host().

I'm not sure this solution would work for nova network though.  It's a
bit confusing, currently the networking API has three routines that seem
to overlap:

setup_networks_on_host() -- this actually does setup or teardown, and is
empty for neutron

setup_instance_network_on_host() -- this maps to
self._update_port_binding_for_instance() for neutron.  For nova network
it maps to self.migrate_instance_finish() but that doesn't actually seem
to do anything.

cleanup_instance_network_on_host() -- this is empty for neutron.  It
maps to self.migrate_instance_start for nova network, but that doesn't
actually seem to do anything.

It seems like for neutron we use setup_instance_network_on_host() and
for nova-network we use setup_networks_on_host() and the rest are not
actually used.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute network

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1444171

Title:
  evacuate code path is not updating binding:host_id in neutron

Status in OpenStack Compute (Nova):
  New

Bug description:
  Currently when using neutron we don't update the binding:host_id
  during the evacuate code path.

  This can cause the evacuation to fail if we go to sleep waiting for
  events in
  virt.libvirt.driver.LibvirtDriver._create_domain_and_network().  Since
  the binding:host_id in neutron is still pointing to the old host,
  neutron will never send us any events and we will eventually time out.

  I was able to get the evacuation to update the binding by modifying
  compute.manager.ComputeManager.rebuild_instance() to add a call to
  self.network_api.setup_instance_network_on_host() right below the
  existing call to self.network_api.setup_networks_on_host().

  I'm not sure this solution would work for nova network though.  It's a
  bit confusing, currently the networking API has three routines that
  seem to overlap:

  setup_networks_on_host() -- this actually does setup or teardown, and
  is empty for neutron

  setup_instance_network_on_host() -- this maps to
  self._update_port_binding_for_instance() for neutron.  For nova
  network it maps to self.migrate_instance_finish() but that doesn't
  actually seem to do anything.

  cleanup_instance_network_on_host() -- this is empty for neutron.  It
  maps to self.migrate_instance_start for nova network, but that doesn't
  actually seem to do anything.

  It seems like for neutron we use setup_instance_network_on_host() and
  for nova-network we use setup_networks_on_host() and the rest are not
  actually used.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1444171/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1298513] Re: nova server group policy should be applied when resizing/migrating server

2015-03-23 Thread Chris Friesen
*** This bug is a duplicate of bug 1379451 ***
https://bugs.launchpad.net/bugs/1379451

** This bug has been marked a duplicate of bug 1379451
   anti-affinity policy only honored on boot

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1298513

Title:
  nova server group policy should be applied when resizing/migrating
  server

Status in OpenStack Compute (Nova):
  Confirmed

Bug description:
  If I do the following:

  nova server-group-create --policy affinity affinitygroup
  nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=group_uuid cirros0
  nova resize cirros0 2

  The cirros0 server will be resized but when the scheduler runs it
  doesn't take into account the scheduling policy of the server group.

  I think the same would be true if we migrate the server.

  Lastly, when doing migration/evacuation and the user has specified the
  compute node we might want to validate the choice against the group
  policy.  For emergencies we might want to allow policy violation with
  a --force option or something.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1298513/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1423648] [NEW] race conditions with server group scheduler policies

2015-02-19 Thread Chris Friesen
Public bug reported:

In git commit a79ecbe Russel Bryant submitted a partial fix for a race
condition when booting an instance as part of a server group with an
anti-affinity scheduler policy.

That fix only solves part of the problem, however.  There are a number
of issues remaining:

1) It's possible to hit a similar race condition for server groups with
the affinity policy.  Suppose we create a new group and then create
two instances simultaneously.  The scheduler sees an empty group for
each, assigns them to different compute nodes, and the policy is
violated.  We should add a check in _validate_instance_group_policy() to
cover the affinity case.

2) It's possible to create two instances simultaneously, have them be
scheduled to conflicting hosts, both of them detect the problem in
_validate_instance_group_policy(), both of them get sent back for
rescheduling, and both of them get assigned to conflicting hosts
*again*, resulting in an error.  In order to fix this I propose that
instead of checking against all other instances in the group, we only
check against instances that were created before the current instance.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1423648

Title:
  race conditions with server group scheduler policies

Status in OpenStack Compute (Nova):
  New

Bug description:
  In git commit a79ecbe Russel Bryant submitted a partial fix for a race
  condition when booting an instance as part of a server group with an
  anti-affinity scheduler policy.

  That fix only solves part of the problem, however.  There are a number
  of issues remaining:

  1) It's possible to hit a similar race condition for server groups
  with the affinity policy.  Suppose we create a new group and then
  create two instances simultaneously.  The scheduler sees an empty
  group for each, assigns them to different compute nodes, and the
  policy is violated.  We should add a check in
  _validate_instance_group_policy() to cover the affinity case.

  2) It's possible to create two instances simultaneously, have them be
  scheduled to conflicting hosts, both of them detect the problem in
  _validate_instance_group_policy(), both of them get sent back for
  rescheduling, and both of them get assigned to conflicting hosts
  *again*, resulting in an error.  In order to fix this I propose that
  instead of checking against all other instances in the group, we only
  check against instances that were created before the current instance.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1423648/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1420848] [NEW] nova-compute service spuriously marked as up when disabled

2015-02-11 Thread Chris Friesen
Public bug reported:

I think our usage of the updated_at field to determine whether a
service is up or not is buggy.  Consider this scenario:

1) nova-compute is happily running and is up/enabled on compute-0
2) something causes nova-compute to stop (process crash, hardware fault, power 
failure, network isolation, etc.)
3) a minute later, the nova-compute service is reported as down
4) I run nova service-disable compute-0 nova-compute
5) nova-compute is now reported as up for the next minute

I wonder if it would make sense to have a separate last status
timestamp database field that would only get updated when we get a
service status update and not when we change any other fields.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1420848

Title:
  nova-compute service spuriously marked as up when disabled

Status in OpenStack Compute (Nova):
  New

Bug description:
  I think our usage of the updated_at field to determine whether a
  service is up or not is buggy.  Consider this scenario:

  1) nova-compute is happily running and is up/enabled on compute-0
  2) something causes nova-compute to stop (process crash, hardware fault, 
power failure, network isolation, etc.)
  3) a minute later, the nova-compute service is reported as down
  4) I run nova service-disable compute-0 nova-compute
  5) nova-compute is now reported as up for the next minute

  I wonder if it would make sense to have a separate last status
  timestamp database field that would only get updated when we get a
  service status update and not when we change any other fields.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1420848/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1419115] [NEW] IndexError adding host to availability zone

2015-02-06 Thread Chris Friesen
Public bug reported:

There appears to be a bug in the code dealing with adding a disabled
host to an aggregate that is exported as an availability zone.

I disable the nova-compute service on a host and then tried to add it to
an aggregate that is exported as an availabilty zone. This resulted in
the following error.


   File /usr/lib64/python2.7/site-packages/oslo/utils/excutils.py, line 82, 
in __exit__
 six.reraise(self.type_, self.value, self.tb)
   File /usr/lib64/python2.7/site-packages/nova/exception.py, line 71, in 
wrapped
 return f(self, context, *args, **kw)
   File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3673, in 
add_host_to_aggregate
 aggregate=aggregate)
   File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3591, in 
is_safe_to_update_az
 host_az = host_azs.pop()
 IndexError: pop from empty list
 

The code looks like this:

if 'availability_zone' in metadata:
_hosts = hosts or aggregate.hosts
zones, not_zones = availability_zones.get_availability_zones(
context, with_hosts=True)
for host in _hosts:
# NOTE(sbauza): Host can only be in one AZ, so let's take only
#   the first element
host_azs = [az for (az, az_hosts) in zones
if host in az_hosts
and az != CONF.internal_service_availability_zone]
host_az = host_azs.pop()

It appears that for a disabled host, host_azs can be empty, resulting in
an error when we try to pop() from it.

It works fine if the service is enabled on the host, and it works fine
if the service is diabled and I try to add the host to an aggregate that
is not exported as an availability zone.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1419115

Title:
  IndexError adding host to availability zone

Status in OpenStack Compute (Nova):
  New

Bug description:
  There appears to be a bug in the code dealing with adding a disabled
  host to an aggregate that is exported as an availability zone.

  I disable the nova-compute service on a host and then tried to add it
  to an aggregate that is exported as an availabilty zone. This resulted
  in the following error.

  
 File /usr/lib64/python2.7/site-packages/oslo/utils/excutils.py, line 82, 
in __exit__
   six.reraise(self.type_, self.value, self.tb)
 File /usr/lib64/python2.7/site-packages/nova/exception.py, line 71, in 
wrapped
   return f(self, context, *args, **kw)
 File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3673, 
in add_host_to_aggregate
   aggregate=aggregate)
 File /usr/lib64/python2.7/site-packages/nova/compute/api.py, line 3591, 
in is_safe_to_update_az
   host_az = host_azs.pop()
   IndexError: pop from empty list
   

  The code looks like this:

  if 'availability_zone' in metadata:
  _hosts = hosts or aggregate.hosts
  zones, not_zones = availability_zones.get_availability_zones(
  context, with_hosts=True)
  for host in _hosts:
  # NOTE(sbauza): Host can only be in one AZ, so let's take only
  #   the first element
  host_azs = [az for (az, az_hosts) in zones
  if host in az_hosts
  and az != CONF.internal_service_availability_zone]
  host_az = host_azs.pop()

  It appears that for a disabled host, host_azs can be empty, resulting
  in an error when we try to pop() from it.

  It works fine if the service is enabled on the host, and it works fine
  if the service is diabled and I try to add the host to an aggregate
  that is not exported as an availability zone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1419115/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1417667] [NEW] migration/evacuation/rebuild/resize of instance with dedicated cpus needs to recalculate cpus on destination

2015-02-03 Thread Chris Friesen
Public bug reported:

I'm running nova trunk, commit 752954a.

I configured a flavor with two vcpus and extra specs
hw:cpu_policy=dedicated in order to enable vcpu pinning.

I booted up a number of instances such that there was one instance
affined to host cpus 12 and 13 on compute-0, and another instance
affined to cpus 12 and 13 on compute-2.  (As reported by virsh vcpupin
and virsh dumpxml.)

I then triggered a live migration of one instance from compute-0 to
compute-2.  This resulted in both instances being affined to host cpus
12 and 13 on compute-2.

The hw:cpu_policy=dedicated extra spec is intended to provide
dedicated host cpus for the instance.  In order to provide this, on a
live migration (or cold migration, or rebuild, or evacuation, or resize,
etc.) nova needs to ensure that the instance is affined to host cpus
that are not currently being used by other instances.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1417667

Title:
  migration/evacuation/rebuild/resize of instance with dedicated cpus
  needs to recalculate cpus on destination

Status in OpenStack Compute (Nova):
  New

Bug description:
  I'm running nova trunk, commit 752954a.

  I configured a flavor with two vcpus and extra specs
  hw:cpu_policy=dedicated in order to enable vcpu pinning.

  I booted up a number of instances such that there was one instance
  affined to host cpus 12 and 13 on compute-0, and another instance
  affined to cpus 12 and 13 on compute-2.  (As reported by virsh
  vcpupin and virsh dumpxml.)

  I then triggered a live migration of one instance from compute-0 to
  compute-2.  This resulted in both instances being affined to host cpus
  12 and 13 on compute-2.

  The hw:cpu_policy=dedicated extra spec is intended to provide
  dedicated host cpus for the instance.  In order to provide this, on a
  live migration (or cold migration, or rebuild, or evacuation, or
  resize, etc.) nova needs to ensure that the instance is affined to
  host cpus that are not currently being used by other instances.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1417667/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1417671] [NEW] when using dedicated cpus, the emulator thread should be affined as well

2015-02-03 Thread Chris Friesen
Public bug reported:

I'm running nova trunk, commit 752954a.

I configured a flavor with two vcpus and extra specs
hw:cpu_policy=dedicated in order to enable vcpu pinning.

I booted up an instance with this flavor, and virsh dumpxml shows that
the two vcpus were affined suitably to host cpus, but the emulator
thread was left to float across the available host cores on that numa
node.

  cputune
shares2048/shares
vcpupin vcpu='0' policy='other' priority='0' cpuset='4'/
vcpupin vcpu='1' policy='other' priority='0' cpuset='5'/
emulatorpin cpuset='3-11'/
  /cputune


Looking at the kvm process shortly after creation, we see quite a few
emulator threads running with the emulatorpin affinity:

compute-2:~$ taskset -apc 136143
pid 136143's current affinity list: 3-11
pid 136144's current affinity list: 0,3-24,27-47
pid 136146's current affinity list: 4
pid 136147's current affinity list: 5
pid 136149's current affinity list: 0
pid 136433's current affinity list: 3-11
pid 136434's current affinity list: 3-11
pid 136435's current affinity list: 3-11
pid 136436's current affinity list: 3-11
pid 136437's current affinity list: 3-11
pid 136438's current affinity list: 3-11
pid 136439's current affinity list: 3-11
pid 136440's current affinity list: 3-11
pid 136441's current affinity list: 3-11
pid 136442's current affinity list: 3-11
pid 136443's current affinity list: 3-11
pid 136444's current affinity list: 3-11
pid 136445's current affinity list: 3-11
pid 136446's current affinity list: 3-11
pid 136447's current affinity list: 3-11
pid 136448's current affinity list: 3-11
pid 136449's current affinity list: 3-11
pid 136450's current affinity list: 3-11
pid 136451's current affinity list: 3-11
pid 136452's current affinity list: 3-11
pid 136453's current affinity list: 3-11
pid 136454's current affinity list: 3-11


Since the purpose of hw:cpu_policy=dedicated is to provide a dedicated host 
CPU for each guest CPU, the libvirt emulatorpin cpuset for a given guest should 
be set to one (or possibly more) of the CPUs specified for that guest.  
Otherwise, any work done by the emulator threads could rob CPU time from 
another guest instance.

Personally I'd like to see the emulator thread affined the same as guest
vCPU 0 (we use guest vCPU0 as a maintenance processor while doing the
real work on the other vCPUs), but an argument could be made that it
should be affined to the logical OR of all the guest vCPU cpusets.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1417671

Title:
  when using dedicated cpus, the emulator thread should be affined as
  well

Status in OpenStack Compute (Nova):
  New

Bug description:
  I'm running nova trunk, commit 752954a.

  I configured a flavor with two vcpus and extra specs
  hw:cpu_policy=dedicated in order to enable vcpu pinning.

  I booted up an instance with this flavor, and virsh dumpxml shows
  that the two vcpus were affined suitably to host cpus, but the
  emulator thread was left to float across the available host cores on
  that numa node.

cputune
  shares2048/shares
  vcpupin vcpu='0' policy='other' priority='0' cpuset='4'/
  vcpupin vcpu='1' policy='other' priority='0' cpuset='5'/
  emulatorpin cpuset='3-11'/
/cputune


  Looking at the kvm process shortly after creation, we see quite a few
  emulator threads running with the emulatorpin affinity:

  compute-2:~$ taskset -apc 136143
  pid 136143's current affinity list: 3-11
  pid 136144's current affinity list: 0,3-24,27-47
  pid 136146's current affinity list: 4
  pid 136147's current affinity list: 5
  pid 136149's current affinity list: 0
  pid 136433's current affinity list: 3-11
  pid 136434's current affinity list: 3-11
  pid 136435's current affinity list: 3-11
  pid 136436's current affinity list: 3-11
  pid 136437's current affinity list: 3-11
  pid 136438's current affinity list: 3-11
  pid 136439's current affinity list: 3-11
  pid 136440's current affinity list: 3-11
  pid 136441's current affinity list: 3-11
  pid 136442's current affinity list: 3-11
  pid 136443's current affinity list: 3-11
  pid 136444's current affinity list: 3-11
  pid 136445's current affinity list: 3-11
  pid 136446's current affinity list: 3-11
  pid 136447's current affinity list: 3-11
  pid 136448's current affinity list: 3-11
  pid 136449's current affinity list: 3-11
  pid 136450's current affinity list: 3-11
  pid 136451's current affinity list: 3-11
  pid 136452's current affinity list: 3-11
  pid 136453's current affinity list: 3-11
  pid 136454's current affinity list: 3-11

  
  Since the purpose of hw:cpu_policy=dedicated is to provide a dedicated host 
CPU for each guest CPU, the libvirt emulatorpin cpuset for a given guest should 
be set to one (or 

[Yahoo-eng-team] [Bug 1417723] [NEW] when using dedicated cpus, the guest topology doesn't match the host

2015-02-03 Thread Chris Friesen
Public bug reported:

According to http://specs.openstack.org/openstack/nova-
specs/specs/juno/approved/virt-driver-cpu-pinning.html, the topology of
the guest is set up as follows:

In the absence of an explicit vCPU topology request, the virt drivers
typically expose all vCPUs as sockets with 1 core and 1 thread. When
strict CPU pinning is in effect the guest CPU topology will be setup to
match the topology of the CPUs to which it is pinned.

What I'm seeing is that when strict CPU pinning is in use the guest
seems to be configuring multiple threads, even if the host doesn't have
theading enabled.

As an example, I set up a flavor with 2 vCPUs and enabled dedicated
cpus.  I then booted up an instance of this flavor on two separate
compute nodes, one with hyperthreading enabled and one with
hyperthreading disabled.  In both cases, virsh dumpxml gave the
following topology:

topology sockets='1' cores='1' threads='2'/

When running on the system with hyperthreading disabled, this should
presumably have been set to cores=2 threads=1.

Taking this a bit further, even if hyperthreading is enabled on the host
it would be more accurate to only specify multiple threads in the guest
topology if the vCPUs are actually affined to multiple threads of the
same host core.  Otherwise it would be more accurate to specify the
guest topology with multiple cores of one thread each.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1417723

Title:
  when using dedicated cpus, the guest topology doesn't match the host

Status in OpenStack Compute (Nova):
  New

Bug description:
  According to http://specs.openstack.org/openstack/nova-
  specs/specs/juno/approved/virt-driver-cpu-pinning.html, the topology
  of the guest is set up as follows:

  In the absence of an explicit vCPU topology request, the virt drivers
  typically expose all vCPUs as sockets with 1 core and 1 thread. When
  strict CPU pinning is in effect the guest CPU topology will be setup
  to match the topology of the CPUs to which it is pinned.

  What I'm seeing is that when strict CPU pinning is in use the guest
  seems to be configuring multiple threads, even if the host doesn't
  have theading enabled.

  As an example, I set up a flavor with 2 vCPUs and enabled dedicated
  cpus.  I then booted up an instance of this flavor on two separate
  compute nodes, one with hyperthreading enabled and one with
  hyperthreading disabled.  In both cases, virsh dumpxml gave the
  following topology:

  topology sockets='1' cores='1' threads='2'/

  When running on the system with hyperthreading disabled, this should
  presumably have been set to cores=2 threads=1.

  Taking this a bit further, even if hyperthreading is enabled on the
  host it would be more accurate to only specify multiple threads in the
  guest topology if the vCPUs are actually affined to multiple threads
  of the same host core.  Otherwise it would be more accurate to specify
  the guest topology with multiple cores of one thread each.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1417723/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1417201] [NEW] nova-scheduler exception when trying to use hugepages

2015-02-02 Thread Chris Friesen
Public bug reported:

I'm trying to make use of huge pages as described in
http://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented
/virt-driver-large-pages.html.  I'm running nova kilo as of Jan 27th.
The other openstack services are juno.  Libvirt is 1.2.8.

I've allocated 1 2MB pages on a compute node.  virsh capabilities
on that node contains:

topology
  cells num='2'
cell id='0'
  memory unit='KiB'67028244/memory
  pages unit='KiB' size='4'16032069/pages
  pages unit='KiB' size='2048'5000/pages
  pages unit='KiB' size='1048576'1/pages
...
cell id='1'
  memory unit='KiB'67108864/memory
  pages unit='KiB' size='4'16052224/pages
  pages unit='KiB' size='2048'5000/pages
  pages unit='KiB' size='1048576'1/pages

I then restarted nova-compute, I set hw:mem_page_size=large on a
flavor, and then tried to boot up an instance with that flavor.  I got
the error logs below in nova-scheduler.  Is this a bug?

Feb  2 16:23:10 controller-0 nova-scheduler Exception during message handling: 
Cannot load 'mempages' in the base class
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher Traceback 
(most recent call last):
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 
134, in _dispatch_and_reply
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher 
incoming.message))
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 
177, in _dispatch
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher return 
self._do_dispatch(endpoint, method, ctxt, args)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 
123, in _do_dispatch
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher result = 
getattr(endpoint, method)(ctxt, **new_args)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py, line 139, in 
inner
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher return 
func(*args, **kwargs)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/scheduler/manager.py, line 86, in 
select_destinations
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher 
filter_properties)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py, line 
67, in select_destinations
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher 
filter_properties)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py, line 
138, in _schedule
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher 
filter_properties, index=num)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py, line 391, 
in get_filtered_hosts
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher hosts, 
filter_properties, index)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/filters.py, line 77, in 
get_filtered_objects
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher list_objs 
= list(objs)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/filters.py, line 43, in filter_all
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher if 
self._filter_one(obj, filter_properties):
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/scheduler/filters/__init__.py, line 
27, in _filter_one
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher return 
self.host_passes(obj, filter_properties)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/scheduler/filters/numa_topology_filter.py,
 line 45, in host_passes
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher 
limits_topology=limits))
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/virt/hardware.py, line 1161, in 
numa_fit_instance_to_host
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher 
host_cell, instance_cell, limit_cell)
2015-02-02 16:23:10.746 37521 TRACE oslo.messaging.rpc.dispatcher   File 
/usr/lib64/python2.7/site-packages/nova/virt/hardware.py, line 851, in 
_numa_fit_instance_cell
2015-02-02 16:23:10.746 

[Yahoo-eng-team] [Bug 1410924] [NEW] instructions for rebuilding API samples are wrong

2015-01-14 Thread Chris Friesen
Public bug reported:

The instructions in nova/tests/functional/api_samples/README.rst say to
run GENERATE_SAMPLES=True tox -epy27 nova.tests.unit.integrated, but
that path doesn't exist anymore.

Running GENERATE_SAMPLES=True tox -e functional seems to work, but
someone who knows more than me should double-check that.

It looks like this was missed when a bunch of functional tests were
moved into nova/tests/functional.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1410924

Title:
  instructions for rebuilding API samples are wrong

Status in OpenStack Compute (Nova):
  New

Bug description:
  The instructions in nova/tests/functional/api_samples/README.rst say
  to run GENERATE_SAMPLES=True tox -epy27 nova.tests.unit.integrated,
  but that path doesn't exist anymore.

  Running GENERATE_SAMPLES=True tox -e functional seems to work, but
  someone who knows more than me should double-check that.

  It looks like this was missed when a bunch of functional tests were
  moved into nova/tests/functional.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1410924/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1368917] [NEW] rpc core should abort a call() early if the connection is terminated before the timeout period expires

2014-09-12 Thread Chris Friesen
Public bug reported:

As it stands, if a client issuing an RPC call() sends a message to the
rabbitmq server, then the rabbitmq server does a failover the client
will wait for the full RPC timeout period (60 seconds) even though new
rabbitmq server has come up long before then and some connections have
been reestablished.

The RPC core should notice that the server has gone away and should
notify any entities waiting for an RPC call() response so that they can
error out early rather than waiting for the full RPC timeout period.

This was detected on Havana, but it seems to apply to all other versions
as well.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: oslo

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1368917

Title:
  rpc core should abort a call() early if the connection is terminated
  before the timeout period expires

Status in OpenStack Compute (Nova):
  New

Bug description:
  As it stands, if a client issuing an RPC call() sends a message to the
  rabbitmq server, then the rabbitmq server does a failover the client
  will wait for the full RPC timeout period (60 seconds) even though new
  rabbitmq server has come up long before then and some connections have
  been reestablished.

  The RPC core should notice that the server has gone away and should
  notify any entities waiting for an RPC call() response so that they
  can error out early rather than waiting for the full RPC timeout
  period.

  This was detected on Havana, but it seems to apply to all other
  versions as well.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1368917/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1368989] [NEW] service_update() should not set an RPC timeout longer than service.report_interval

2014-09-12 Thread Chris Friesen
Public bug reported:

nova.servicegroup.drivers.db.DbDriver._report_state() is called every
service.report_interval seconds from a timer in order to periodically
report the service state.  It calls self.conductor_api.service_update().

If this ends up calling
nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC
call() to nova-conductor.

If anything happens to the RPC server (failover, switchover, etc.) by
default the RPC code will wait 60 seconds for a response (blocking the
timer-based calling of _report_state() in the meantime).  This is long
enough to cause the status in the database to get old enough that other
services consider this service to be down.

Arguably, since we're going to call service_update( ) again in
service.report_interval seconds there's no reason to wait the full 60
seconds.  Instead, it would make sense to set the RPC timeout for the
service_update() call to to something slightly less than
service.report_interval seconds.

I've also submitted a related bug report
(https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of
connection in general, but I expect that'll take a while to deal with
while this particular case can be handled much more easily.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1368989

Title:
  service_update() should not set an RPC timeout longer than
  service.report_interval

Status in OpenStack Compute (Nova):
  New

Bug description:
  nova.servicegroup.drivers.db.DbDriver._report_state() is called every
  service.report_interval seconds from a timer in order to periodically
  report the service state.  It calls
  self.conductor_api.service_update().

  If this ends up calling
  nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC
  call() to nova-conductor.

  If anything happens to the RPC server (failover, switchover, etc.) by
  default the RPC code will wait 60 seconds for a response (blocking the
  timer-based calling of _report_state() in the meantime).  This is long
  enough to cause the status in the database to get old enough that
  other services consider this service to be down.

  Arguably, since we're going to call service_update( ) again in
  service.report_interval seconds there's no reason to wait the full 60
  seconds.  Instead, it would make sense to set the RPC timeout for the
  service_update() call to to something slightly less than
  service.report_interval seconds.

  I've also submitted a related bug report
  (https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of
  connection in general, but I expect that'll take a while to deal with
  while this particular case can be handled much more easily.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1368989/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1330744] [NEW] live migration is incorrectly comparing host cpu features

2014-06-16 Thread Chris Friesen
Public bug reported:

Runnng Havana, we're seeing live migration fail when attempting to
migrate from an Ivy-Bridge host to a Sandy-Bridge host.

However, we're using the default kvm guest config which has a safe
default virtual cpu with a subset of cpu features.  /proc/cpuinfo from
within the guest looks the same on both types of hosts.

I think the problem is that when check_can_live_migrate_destination()
calls _compare_cpu(), it's comparing the host CPUs.  Instead, I think we
should be comparing the guest CPU against the host CPU of the
destination to make sure it's compatible.  (Assuming that libvirt
considers the qemu virtual cpu to be compatible with the host cpu.)

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1330744

Title:
  live migration is incorrectly comparing host cpu features

Status in OpenStack Compute (Nova):
  New

Bug description:
  Runnng Havana, we're seeing live migration fail when attempting to
  migrate from an Ivy-Bridge host to a Sandy-Bridge host.

  However, we're using the default kvm guest config which has a safe
  default virtual cpu with a subset of cpu features.  /proc/cpuinfo from
  within the guest looks the same on both types of hosts.

  I think the problem is that when check_can_live_migrate_destination()
  calls _compare_cpu(), it's comparing the host CPUs.  Instead, I think
  we should be comparing the guest CPU against the host CPU of the
  destination to make sure it's compatible.  (Assuming that libvirt
  considers the qemu virtual cpu to be compatible with the host cpu.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1330744/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1313967] [NEW] build_and_run_instance() appears to be dead code

2014-04-28 Thread Chris Friesen
Public bug reported:

In nova/compute/manager.py, the code path

build_and_run_instance()
do_build_and_run_instance()
 _build_and_run_instance()

seems to be dead code, used by nothing but the unit tests.  It looks
like the code that is actually being used is run_instance().

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: New

** Changed in: nova
 Assignee: (unassigned) = Chris Friesen (cbf123)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1313967

Title:
  build_and_run_instance() appears to be dead code

Status in OpenStack Compute (Nova):
  New

Bug description:
  In nova/compute/manager.py, the code path

  build_and_run_instance()
  do_build_and_run_instance()
   _build_and_run_instance()

  seems to be dead code, used by nothing but the unit tests.  It looks
  like the code that is actually being used is run_instance().

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1313967/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1313967] Re: build_and_run_instance() appears to be dead code

2014-04-28 Thread Chris Friesen
Sorry for the noise, I started reading the code and realized that it was
just taking a long time to transition over to the new function.

** Changed in: nova
   Status: New = Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1313967

Title:
  build_and_run_instance() appears to be dead code

Status in OpenStack Compute (Nova):
  Invalid

Bug description:
  In nova/compute/manager.py, the code path

  build_and_run_instance()
  do_build_and_run_instance()
   _build_and_run_instance()

  seems to be dead code, used by nothing but the unit tests.  It looks
  like the code that is actually being used is run_instance().

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1313967/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1311793] [NEW] wrap_instance_event() swallows return codes

2014-04-23 Thread Chris Friesen
Public bug reported:

 In compute/manager.py the function wrap_instance_event() just calls
function().

This means that if it's used to decorate a function that returns a
value, then the caller will never see the return code.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: New

** Changed in: nova
 Assignee: (unassigned) = Chris Friesen (cbf123)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1311793

Title:
  wrap_instance_event() swallows return codes

Status in OpenStack Compute (Nova):
  New

Bug description:
   In compute/manager.py the function wrap_instance_event() just calls
  function().

  This means that if it's used to decorate a function that returns a
  value, then the caller will never see the return code.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1311793/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1298494] [NEW] nova server-group-list doesn't show members of the group

2014-03-27 Thread Chris Friesen
Public bug reported:

With current devstack I ensured I had GroupAntiAffinityFilter in
scheduler_default_filters in /etc/nova/nova.conf, restarted nova-
scheduler, then ran:


nova server-group-create --policy anti-affinity antiaffinitygroup


nova server-group-list
+--+---++-+--+
| Id   | Name  | Policies   
| Members | Metadata |
+--+---++-+--+
| 5d639349-1b77-43df-b13f-ed586e73b3ac | antiaffinitygroup | [u'anti-affinity'] 
| []  | {}   |
+--+---++-+--+


nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=5d639349-1b77-43df-b13f-ed586e73b3ac cirros0

nova list
+--+-+++-++
| ID   | Name| Status | Task State | Power 
State | Networks   |
+--+-+++-++
| a7a3ec40-85d9-4b72-a522-d1c0684f3ada | cirros0 | ACTIVE | -  | 
Running | private=10.4.128.2 |
+--+-+++-++


Then I tried listing the groups, and it didn't print the newly-booted
instance as a member:

nova server-group-list
+--+---++-+--+
| Id   | Name  | Policies   
| Members | Metadata |
+--+---++-+--+
| 5d639349-1b77-43df-b13f-ed586e73b3ac | antiaffinitygroup | [u'anti-affinity'] 
| []  | {}   |
+--+---++-+--+


Rerunning the nova command with --debug we see that the problem is in nova, not 
novaclient:

RESP BODY: {server_groups: [{members: [], metadata: {}, id:
5d639349-1b77-43df-b13f-ed586e73b3ac, policies: [anti-affinity],
name: antiaffinitygroup}]}


Looking at the database, we see that the instance is actually tracked as a 
member of the list (along with two other instances that haven't been marked as 
deleted yet, which is also a bug I think).

mysql select * from instance_group_member;
+-+++-++--+--+
| created_at  | updated_at | deleted_at | deleted | id | instance_id
  | group_id |
+-+++-++--+--+
| 2014-03-26 20:19:14 | NULL   | NULL   |   0 |  1 | 
d289502b-57fc-46f6-b39d-66a1db3a9ebc |1 |
| 2014-03-26 20:25:04 | NULL   | NULL   |   0 |  2 | 
e07f1f15-4e93-4845-9203-bf928c196a78 |1 |
| 2014-03-26 20:35:11 | NULL   | NULL   |   0 |  3 | 
a7a3ec40-85d9-4b72-a522-d1c0684f3ada |1 |
+-+++-++--+--+
3 rows in set (0.00 sec)

** Affects: nova
 Importance: Undecided
 Status: New

** Summary changed:

- nova instance-group-list doesn't show members of the group
+ nova server-group-list doesn't show members of the group

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1298494

Title:
  nova server-group-list doesn't show members of the group

Status in OpenStack Compute (Nova):
  New

Bug description:
  With current devstack I ensured I had GroupAntiAffinityFilter in
  scheduler_default_filters in /etc/nova/nova.conf, restarted nova-
  scheduler, then ran:

  
  nova server-group-create --policy anti-affinity antiaffinitygroup

  
  nova server-group-list
  
+--+---++-+--+
  | Id   | Name  | Policies 
  | Members | Metadata |
  
+--+---++-+--+
  | 5d639349-1b77-43df-b13f-ed586e73b3ac | antiaffinitygroup | 
[u'anti-affinity'] | []  | {}   |
  
+--+---++-+--+

  
  nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=5d639349-1b77-43df-b13f-ed586e73b3ac cirros0

  nova list
  
+--+-+++-++
  | ID   

[Yahoo-eng-team] [Bug 1298509] [NEW] nova server-group-delete allows deleting server group with members

2014-03-27 Thread Chris Friesen
Public bug reported:

Currently nova will let you do this:

nova server-group-create --policy anti-affinity antiaffinitygroup
nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid 
cirros0
nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid 
cirros1
nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid 
cirros2
nova server-group-delete group_uuid

Given that a server group is designed to logically group servers
together, I don't think it makes sense to allow nova to delete a server
group that currently has undeleted members in it.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute

** Tags added: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1298509

Title:
  nova server-group-delete allows deleting server group with members

Status in OpenStack Compute (Nova):
  New

Bug description:
  Currently nova will let you do this:

  nova server-group-create --policy anti-affinity antiaffinitygroup
  nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=group_uuid cirros0
  nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=group_uuid cirros1
  nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=group_uuid cirros2
  nova server-group-delete group_uuid

  Given that a server group is designed to logically group servers
  together, I don't think it makes sense to allow nova to delete a
  server group that currently has undeleted members in it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1298509/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1298513] [NEW] nova server group policy should be applied when resizing/migrating server

2014-03-27 Thread Chris Friesen
Public bug reported:

If I do the following:

nova server-group-create --policy affinity affinitygroup
nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint group=group_uuid 
cirros0
nova resize cirros0 2

The cirros0 server will be resized but when the scheduler runs it
doesn't take into account the scheduling policy of the server group.

I think the same would be true if we migrate the server.

Lastly, when doing migration/evacuation and the user has specified the
compute node we might want to validate the choice against the group
policy.  For emergencies we might want to allow policy violation with a
--force option or something.

** Affects: nova
 Importance: Undecided
 Status: New

** Description changed:

- If we try to resize/migrate a server that is part of a server group, the
- server group policy should be applied when scheduling the server.
+ If I do the following:
+ 
+ nova server-group-create --policy affinity affinitygroup
+ nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=group_uuid cirros0
+ nova resize cirros0 2
+ 
+ The cirros0 server will be resized but when the scheduler runs it
+ doesn't take into account the scheduling policy of the server group.
+ 
+ I think the same would be true if we migrate the server.
+ 
+ Lastly, when doing migration/evacuation and the user has specified the
+ compute node we might want to validate the choice against the group
+ policy.  For emergencies we might want to allow policy violation with a
+ --force option or something.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1298513

Title:
  nova server group policy should be applied when resizing/migrating
  server

Status in OpenStack Compute (Nova):
  New

Bug description:
  If I do the following:

  nova server-group-create --policy affinity affinitygroup
  nova boot --flavor=1 --image=cirros-0.3.1-x86_64-uec --hint 
group=group_uuid cirros0
  nova resize cirros0 2

  The cirros0 server will be resized but when the scheduler runs it
  doesn't take into account the scheduling policy of the server group.

  I think the same would be true if we migrate the server.

  Lastly, when doing migration/evacuation and the user has specified the
  compute node we might want to validate the choice against the group
  policy.  For emergencies we might want to allow policy violation with
  a --force option or something.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1298513/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1298690] [NEW] sqlite regexp() function doesn't behave like mysql

2014-03-27 Thread Chris Friesen
Public bug reported:

In bug 1298494 I recently saw a case where the unit tests (using sqlite)
behaved differently than devstack with mysql.

The issue seems to be when we do

filters = {'uuid': group.members, 'deleted_at': None}
instances = instance_obj.InstanceList.get_by_filters(
context, filters=filters)


Eventually down in db/sqlalchemy/api.py we end up calling

query = query.filter(column_attr.op(db_regexp_op)(
 str(filters[filter_name])))

where str(filters[filter_name]) is the string 'None'.

When using mysql, a regexp comparison of the string 'None' against a
NULL field fails to match.

Since sqlite doesn't have its own regexp function we provide one in
openstack/common/db/sqlalchemy/session.py.  In the buggy case we end up
calling it as regexp('None', None), where the types are unicode and
NoneType.  However, we end up converting the second arg to text type
before calling reg.search() on it, so it matches.

This is a bug, we want the unit tests to behave like the real system.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: compute db

** Tags added: compute db

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1298690

Title:
  sqlite regexp() function doesn't behave like mysql

Status in OpenStack Compute (Nova):
  New

Bug description:
  In bug 1298494 I recently saw a case where the unit tests (using
  sqlite) behaved differently than devstack with mysql.

  The issue seems to be when we do

  filters = {'uuid': group.members, 'deleted_at': None}
  instances = instance_obj.InstanceList.get_by_filters(
  context, filters=filters)

  
  Eventually down in db/sqlalchemy/api.py we end up calling

  query = query.filter(column_attr.op(db_regexp_op)(
   str(filters[filter_name])))

  where str(filters[filter_name]) is the string 'None'.

  When using mysql, a regexp comparison of the string 'None' against a
  NULL field fails to match.

  Since sqlite doesn't have its own regexp function we provide one in
  openstack/common/db/sqlalchemy/session.py.  In the buggy case we end
  up calling it as regexp('None', None), where the types are unicode
  and NoneType.  However, we end up converting the second arg to text
  type before calling reg.search() on it, so it matches.

  This is a bug, we want the unit tests to behave like the real system.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1298690/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1296967] [NEW] instances stuck with task_state of REBOOTING after controller switchover

2014-03-24 Thread Chris Friesen
Public bug reported:


We were doing some testing of Havana and have run into a scenario that ended up 
with two instances stuck with a task_state of REBOOTING following a reboot of 
the controller:

1) We reboot the controller.
2) Right after it comes back up something calls compute.api.API.reboot() on an 
instance.
3) That sets instance.task_state = task_states.REBOOTING and then calls 
instance.save() to update the database.
4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
5) That message gets dropped on the floor due to communication issues between 
the controller and the compute.
6) Now we're stuck with a task_state of REBOOTING. 

Currently when doing a reboot we set the REBOOTING task_state in the
database in compute-api and then send an RPC cast. That seems awfully
risky given that if that message gets lost or the call fails for any
reason we could end up stuck in the REBOOTING state forever.  I think it
might make sense to have the power state audit clear the REBOOTING state
if appropriate, but others with more experience should make that call.


It didn't happen to use, but I think we could get into this state another way:

1) nova-compute was running reboot_instance()
2) we reboot the controller
3) reboot_instance() times out trying to update the instance with the the new 
power state and a task_state of None.
4) Later on in _sync_power_states() we would update the power_state, but 
nothing would update the task_state.  


The timeline that I have looks like this.  We had some buggy code that
sent all the instances for a reboot when the controller came up.  The
first two are in the controller logs below, and these are the ones that
failed.

controller: (running everything but nova-compute)
nova-api log:

/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO 
nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO 
nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action
 HTTP/1.1 status: 202 len: 185 time: 0.2299521
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO 
nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO 
nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action
 HTTP/1.1 status: 202 len: 185 time: 0.1583798

After this there are other reboot requests for the other instances, and
those ones passed.


Interestingly, we later see this
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO 
nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO 
nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action
 HTTP/1.1 status: 409 len: 303 time: 0.1177511
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO 
nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO 
nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action
 HTTP/1.1 status: 409 len: 303 time: 0.0366399


Presumably the 409 responses are because nova thinks that these instances are 
currently rebooting.


compute:
2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] 
Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] 
Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] 
Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.246 12229 INFO 

[Yahoo-eng-team] [Bug 1296972] Re: RPC code in Havana doesn't handle connection errors

2014-03-24 Thread Chris Friesen
Looks like I misread that patch below, it's adding back the channel
error check, not the connection error check.

This may be due to a bad patch on our end, sorry for the noise.

** Changed in: nova
   Status: New = Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1296972

Title:
  RPC code in Havana doesn't handle connection errors

Status in OpenStack Compute (Nova):
  Invalid

Bug description:
  We've got an HA controller setup using pacemaker and were stress-
  testing it by doing multiple controlled switchovers while doing other
  activity.  Generally this works okay, but last night we ran into a
  problem where nova-compute got into a state where it was unable to
  reconnect with the AMQP server.  Logs are at the bottom, they repeat
  every minute and did this for 7+ hours until the system was manually
  cleaned up.

  
  I've found something in the code that looks a bit suspicious.  The  
Unexpected exception occurred 61 time(s)... retrying. message comes from 
forever_retry_uncaught_exceptions() in excutils.py.  It looks like we're raising

  RecoverableConnectionError: connection already closed

  down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py,
  but nothing handles it.

  It looks like the most likely place that should be handling it is
  nova.openstack.common.rpc.impl_kombu.Connection.ensure().

  In the current oslo.messaging code the ensure() routine explicitly
  handles connection errors (which RecoverableConnectionError is) and
  socket timeouts--the ensure() routine in Havana doesn't do this.

  Maybe we should look at porting
  
https://github.com/openstack/oslo.messaging/commit/0400cbf4f83cf8d58076c7e65e08a156ec3508a8
  to the Havana RPC code?


  Logs showing the start of the problem and the first few iterations of
  the repeating issue:

  
  2014-03-24 09:24:33.566 6620 AUDIT nova.compute.resource_tracker [-] Auditing 
locally available compute resources
  2014-03-24 09:24:34.126 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
instance: name=u'sgw-4', vm_state=u'active', task_state=None, vcpus=2, 
cpuset=0x180, cpulist=[7, 8] pinned, nodelist=[0], node=0 
  2014-03-24 09:24:34.126 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
instance: name=u'sgw-1', vm_state=u'active', task_state=None, vcpus=2, 
cpuset=0x60, cpulist=[5, 6] pinned, nodelist=[0], node=0 
  2014-03-24 09:24:34.126 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
instance: name=u'load_balancer', vm_state=u'active', task_state=None, vcpus=3, 
cpuset=0x1c00, cpulist=[10, 11, 12] pinned, nodelist=[1], node=1 
  2014-03-24 09:24:34.182 6620 AUDIT nova.compute.resource_tracker [-] Free ram 
(MB): 111290, per-node: [52286, 59304], numa nodes:2
  2014-03-24 09:24:34.183 6620 AUDIT nova.compute.resource_tracker [-] Free 
disk (GB): 29
  2014-03-24 09:24:34.183 6620 AUDIT nova.compute.resource_tracker [-] Free 
vcpus: 170, free per-node float vcpus: [48, 112], free per-node pinned vcpus: 
[3, 7]
  2014-03-24 09:24:34.183 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
vcpus:20, Free vcpus:170, 16.0x overcommit, per-cpu float cpulist: [3, 4, 9, 
13, 14, 15, 16, 17, 18, 19]
  2014-03-24 09:24:34.244 6620 INFO nova.compute.resource_tracker [-] 
Compute_service record updated for compute-0:compute-0
  2014-03-24 09:25:36.564 6620 AUDIT nova.compute.resource_tracker [-] Auditing 
locally available compute resources
  2014-03-24 09:25:37.122 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
instance: name=u'sgw-4', vm_state=u'active', task_state=None, vcpus=2, 
cpuset=0x180, cpulist=[7, 8] pinned, nodelist=[0], node=0 
  2014-03-24 09:25:37.122 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
instance: name=u'sgw-1', vm_state=u'active', task_state=None, vcpus=2, 
cpuset=0x60, cpulist=[5, 6] pinned, nodelist=[0], node=0 
  2014-03-24 09:25:37.122 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
instance: name=u'load_balancer', vm_state=u'active', task_state=None, vcpus=3, 
cpuset=0x1c00, cpulist=[10, 11, 12] pinned, nodelist=[1], node=1 
  2014-03-24 09:25:37.182 6620 AUDIT nova.compute.resource_tracker [-] Free ram 
(MB): 111290, per-node: [52286, 59304], numa nodes:2
  2014-03-24 09:25:37.182 6620 AUDIT nova.compute.resource_tracker [-] Free 
disk (GB): 29
  2014-03-24 09:25:37.183 6620 AUDIT nova.compute.resource_tracker [-] Free 
vcpus: 170, free per-node float vcpus: [48, 112], free per-node pinned vcpus: 
[3, 7]
  2014-03-24 09:25:37.183 6620 INFO nova.compute.resource_tracker [-] DETAIL: 
vcpus:20, Free vcpus:170, 16.0x overcommit, per-cpu float cpulist: [3, 4, 9, 
13, 14, 15, 16, 17, 18, 19]
  2014-03-24 09:25:37.245 6620 INFO nova.compute.resource_tracker [-] 
Compute_service record updated for compute-0:compute-0
  2014-03-24 09:26:47.324 6620 ERROR root [-] Unexpected exception occurred 1 
time(s)... retrying.
  2014-03-24 09:26:47.324 

[Yahoo-eng-team] [Bug 1294756] [NEW] missing test for None in sqlalchemy query filter

2014-03-19 Thread Chris Friesen
Public bug reported:

In db.sqlalchemy.api.instance_get_all_by_filters() there is code that
looks like this:

if not filters.pop('soft_deleted', False):
query_prefix = query_prefix.\
filter(models.Instance.vm_state != vm_states.SOFT_DELETED)


In sqlalchemy a comparison against a non-null value will not match null values, 
so the above filter will not return objects where vm_state is NULL.

The problem is that in the Instance object the vm_state field is
declared as nullable.  In many cases vm_state will in fact have a
value, but in get_test_instance() in test/utils.py the value of
vm_state is not specified.

Given the above, it seems that either we need to configure
models.Instance.vm_state as not nullable (and deal with the fallout),
or else we need to update instance_get_all_by_filters() to explicitly
check for None--something like this perhaps:

if not filters.pop('soft_deleted', False):
query_prefix = query_prefix.\
filter(or_(models.Instance.vm_state != vm_states.SOFT_DELETED,
   models.Instance.vm_state == None))

If we want to fix the query, I'll happily submit the updated code.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1294756

Title:
  missing test for None in sqlalchemy query filter

Status in OpenStack Compute (Nova):
  New

Bug description:
  In db.sqlalchemy.api.instance_get_all_by_filters() there is code that
  looks like this:

  if not filters.pop('soft_deleted', False):
  query_prefix = query_prefix.\
  filter(models.Instance.vm_state != vm_states.SOFT_DELETED)

  
  In sqlalchemy a comparison against a non-null value will not match null 
values, so the above filter will not return objects where vm_state is NULL.

  The problem is that in the Instance object the vm_state field is
  declared as nullable.  In many cases vm_state will in fact have a
  value, but in get_test_instance() in test/utils.py the value of
  vm_state is not specified.

  Given the above, it seems that either we need to configure
  models.Instance.vm_state as not nullable (and deal with the
  fallout), or else we need to update instance_get_all_by_filters() to
  explicitly check for None--something like this perhaps:

  if not filters.pop('soft_deleted', False):
  query_prefix = query_prefix.\
  filter(or_(models.Instance.vm_state != vm_states.SOFT_DELETED,
 models.Instance.vm_state == None))

  If we want to fix the query, I'll happily submit the updated code.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1294756/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1292963] [NEW] postgres incompatibility in InstanceGroup.get_hosts()

2014-03-15 Thread Chris Friesen
Public bug reported:

When running InstanceGroup.get_hosts() on a havana installation that
uses postgres I get the following error:


RemoteError: Remote error: ProgrammingError (ProgrammingError) operator does 
not exist: timestamp without time zone ~ unknown
2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 
83439206-3a88-495b-b6c7-6aea1287109f] LINE 3: uuid != instances.uuid AND 
(instances.deleted_at ~ 'None') ...
2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 
83439206-3a88-495b-b6c7-6aea1287109f]^
2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 
83439206-3a88-495b-b6c7-6aea1287109f] HINT:  No operator matches the given name 
and argument type(s). You might need to add explicit type casts.


I'm not a database expert, but after doing some digging, it seems that the 
problem is this line in get_hosts():

filters = {'uuid': filter_uuids, 'deleted_at': None}

It seems that current postgres doesn't allow implicit casts.  If I
change the line to:

filters = {'uuid': filter_uuids, 'deleted': 0}


Then it seems to work.

** Affects: nova
 Importance: Undecided
 Assignee: Chris Friesen (cbf123)
 Status: In Progress

** Changed in: nova
 Assignee: (unassigned) = Chris Friesen (cbf123)

** Changed in: nova
   Status: New = In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1292963

Title:
  postgres incompatibility in InstanceGroup.get_hosts()

Status in OpenStack Compute (Nova):
  In Progress

Bug description:
  When running InstanceGroup.get_hosts() on a havana installation that
  uses postgres I get the following error:

  
  RemoteError: Remote error: ProgrammingError (ProgrammingError) operator does 
not exist: timestamp without time zone ~ unknown
  2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 
83439206-3a88-495b-b6c7-6aea1287109f] LINE 3: uuid != instances.uuid AND 
(instances.deleted_at ~ 'None') ...
  2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 
83439206-3a88-495b-b6c7-6aea1287109f]^
  2014-03-14 09:58:57.193 8164 TRACE nova.compute.manager [instance: 
83439206-3a88-495b-b6c7-6aea1287109f] HINT:  No operator matches the given name 
and argument type(s). You might need to add explicit type casts.

  
  I'm not a database expert, but after doing some digging, it seems that the 
problem is this line in get_hosts():

  filters = {'uuid': filter_uuids, 'deleted_at': None}

  It seems that current postgres doesn't allow implicit casts.  If I
  change the line to:

  filters = {'uuid': filter_uuids, 'deleted': 0}

  
  Then it seems to work.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1292963/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1289064] [NEW] live migration of instance should claim resources on target compute node

2014-03-06 Thread Chris Friesen
Public bug reported:

I'm looking at the current Icehouse code, but this applies to previous
versions as well.

When we create a new instance via _build_instance() or
_build_and_run_instance(), in both cases we call instance_claim() to
test for resources and reserve them.

During a cold migration we call prep_resize() which calls resize_claim()
to reserve resources.

However, when we live-migrate or evacuate an instance we don't do this.
As far as I can see the current code will just spawn the new instance
but the resource usage won't be updated until the audit runs at some
unknown time in the future at which point it will add the new instance
to self.tracked_instances and update the resource usage.

This means that until the audit runs the scheduler has a stale view of
system resources.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1289064

Title:
  live migration of instance should claim resources on target compute
  node

Status in OpenStack Compute (Nova):
  New

Bug description:
  I'm looking at the current Icehouse code, but this applies to previous
  versions as well.

  When we create a new instance via _build_instance() or
  _build_and_run_instance(), in both cases we call instance_claim() to
  test for resources and reserve them.

  During a cold migration we call prep_resize() which calls
  resize_claim() to reserve resources.

  However, when we live-migrate or evacuate an instance we don't do
  this.  As far as I can see the current code will just spawn the new
  instance but the resource usage won't be updated until the audit runs
  at some unknown time in the future at which point it will add the new
  instance to self.tracked_instances and update the resource usage.

  This means that until the audit runs the scheduler has a stale view of
  system resources.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1289064/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp