Re: [openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests
Hi all, I'll disable the shelve/unshelve failing tests in Intel NFV CI for the time being till we have clarity on the problem. Stephen Finucane will be working on that in next week or so. >-Original Message- >From: Chris Friesen [mailto:chris.frie...@windriver.com] >Sent: Wednesday, May 25, 2016 5:40 PM >To: openstack-dev@lists.openstack.org >Subject: Re: [openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve >tests > >On 05/22/2016 05:41 PM, Jay Pipes wrote: >> Hello Novaites, >> >> I've noticed that the Intel NFV CI has been failing all test runs for >> quite some time (at least a few days), always failing the same tests >> around shelve/unshelve operations. > > > >> I looked through the conductor and compute logs to see if I could find >> any possible reasons for the errors and found a number of the >> following errors in the compute logs: >> >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last): >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] File >> "/opt/stack/new/nova/nova/compute/manager.py", line 4230, in >> _unshelve_instance >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] with rt.instance_claim(context, >> instance, limits): > > > >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] >newcell.unpin_cpus(pinned_cpus) >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] File >> "/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus)) >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: >> cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot >> pin/unpin cpus [6] from the following pinned set [0, 2, 4] >> >> on or around the time of the failures in Tempest. >> >> Perhaps tomorrow morning we can look into handling the above exception >> properly from the compute manager, since clearly we shouldn't be >> allowing CPUPinningInvalid to be raised in the resource tracker's >_update_usage() call > >First, it seems wrong to me that an _unshelve_instance() call would result in >unpinning any CPUs. If the instance was using pinned CPUs then I would >expect the CPUs to be unpinned when doing the "shelve" operation. When >we do an instance claim as part of the "unshelve" operation we should be >pinning CPUs, not unpinning them. > >Second, the reason why CPUPinningInvalid gets raised in _update_usage() is >that it has discovered an inconsistency in its view of resources. In this >case, >it's trying to unpin CPU 6 from a set of pinned cpus that doesn't include CPU >6. I think this is a valid concern and should result in an error log. >Whether it >should cause the unshelve operation to fail is a separate question, but it's >definitely a symptom that something is wrong with resource tracking on this >compute node. > >Chris > > >__ > >OpenStack Development Mailing List (not for usage questions) >Unsubscribe: OpenStack-dev- >requ...@lists.openstack.org?subject:unsubscribe >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Intel Research and Development Ireland Limited Registered in Ireland Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests
On 05/22/2016 05:41 PM, Jay Pipes wrote: Hello Novaites, I've noticed that the Intel NFV CI has been failing all test runs for quite some time (at least a few days), always failing the same tests around shelve/unshelve operations. I looked through the conductor and compute logs to see if I could find any possible reasons for the errors and found a number of the following errors in the compute logs: 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last): 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/compute/manager.py", line 4230, in _unshelve_instance 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] with rt.instance_claim(context, instance, limits): 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] newcell.unpin_cpus(pinned_cpus) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus)) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot pin/unpin cpus [6] from the following pinned set [0, 2, 4] on or around the time of the failures in Tempest. Perhaps tomorrow morning we can look into handling the above exception properly from the compute manager, since clearly we shouldn't be allowing CPUPinningInvalid to be raised in the resource tracker's _update_usage() call First, it seems wrong to me that an _unshelve_instance() call would result in unpinning any CPUs. If the instance was using pinned CPUs then I would expect the CPUs to be unpinned when doing the "shelve" operation. When we do an instance claim as part of the "unshelve" operation we should be pinning CPUs, not unpinning them. Second, the reason why CPUPinningInvalid gets raised in _update_usage() is that it has discovered an inconsistency in its view of resources. In this case, it's trying to unpin CPU 6 from a set of pinned cpus that doesn't include CPU 6. I think this is a valid concern and should result in an error log. Whether it should cause the unshelve operation to fail is a separate question, but it's definitely a symptom that something is wrong with resource tracking on this compute node. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests
Hello Novaites, I've noticed that the Intel NFV CI has been failing all test runs for quite some time (at least a few days), always failing the same tests around shelve/unshelve operations. The shelve/unshelve Tempest tests always result in a timeout exception being raised, looking similar to the following, from [1]: 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base Traceback (most recent call last): 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base File "tempest/api/compute/base.py", line 166, in server_check_teardown 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base cls.server_id, 'ACTIVE') 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base File "tempest/common/waiters.py", line 95, in wait_for_server_status 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base raise exceptions.TimeoutException(message) 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base TimeoutException: Request timed out 2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base Details: (ServerActionsTestJSON:tearDown) Server cae6fd47-0968-4922-a03e-3f2872e4eb52 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: None. I looked through the conductor and compute logs to see if I could find any possible reasons for the errors and found a number of the following errors in the compute logs: 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last): 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/compute/manager.py", line 4230, in _unshelve_instance 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] with rt.instance_claim(context, instance, limits): 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] return f(*args, **kwargs) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 151, in instance_claim 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] self._update_usage_from_instance(context, instance_ref) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 827, in _update_usage_from_instance 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] self._update_usage(instance, sign=sign) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 666, in _update_usage 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] self.compute_node, usage, free) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/virt/hardware.py", line 1482, in get_host_numa_usage_from_instance 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] host_numa_topology, instance_numa_topology, free=free)) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/virt/hardware.py", line 1348, in numa_usage_from_instances 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] newcell.unpin_cpus(pinned_cpus) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] File "/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus)) 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot pin/unpin cpus [6] from the following pinned set [0, 2, 4] on or around the time of the failures in Tempest. Perhaps tomorrow morning we can look into handling the above exception properly from the compute manager, since clearly we shouldn't be allowing CPUPinningInvalid to be raised in the resource tracker's _update_usage() call Anyway, see you on IRC tomorrow morning and let's try to fix this. Best, -jay [1] http://intel-openstack-ci-logs.