Re: [openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests

2016-05-26 Thread Znoinski, Waldemar
Hi all,

I'll disable the shelve/unshelve failing tests in Intel NFV CI for the time 
being till we have clarity on the problem.
Stephen Finucane will be working on that in next week or so.

 >-Original Message-
 >From: Chris Friesen [mailto:chris.frie...@windriver.com]
 >Sent: Wednesday, May 25, 2016 5:40 PM
 >To: openstack-dev@lists.openstack.org
 >Subject: Re: [openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve
 >tests
 >
 >On 05/22/2016 05:41 PM, Jay Pipes wrote:
 >> Hello Novaites,
 >>
 >> I've noticed that the Intel NFV CI has been failing all test runs for
 >> quite some time (at least a few days), always failing the same tests
 >> around shelve/unshelve operations.
 >
 >
 >
 >> I looked through the conductor and compute logs to see if I could find
 >> any possible reasons for the errors and found a number of the
 >> following errors in the compute logs:
 >>
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last):
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52]   File
 >> "/opt/stack/new/nova/nova/compute/manager.py", line 4230, in
 >> _unshelve_instance
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52] with rt.instance_claim(context,
 >> instance, limits):
 >
 >
 >
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52]
 >newcell.unpin_cpus(pinned_cpus)
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52]   File
 >> "/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus))
 >> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
 >> cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot
 >> pin/unpin cpus [6] from the following pinned set [0, 2, 4]
 >>
 >> on or around the time of the failures in Tempest.
 >>
 >> Perhaps tomorrow morning we can look into handling the above exception
 >> properly from the compute manager, since clearly we shouldn't be
 >> allowing CPUPinningInvalid to be raised in the resource tracker's
 >_update_usage() call
 >
 >First, it seems wrong to me that an _unshelve_instance() call would result in
 >unpinning any CPUs.  If the instance was using pinned CPUs then I would
 >expect the CPUs to be unpinned when doing the "shelve" operation.  When
 >we do an instance claim as part of the "unshelve" operation we should be
 >pinning CPUs, not unpinning them.
 >
 >Second, the reason why CPUPinningInvalid gets raised in _update_usage() is
 >that it has discovered an inconsistency in its view of resources.  In this 
 >case,
 >it's trying to unpin CPU 6 from a set of pinned cpus that doesn't include CPU
 >6.  I think this is a valid concern and should result in an error log.  
 >Whether it
 >should cause the unshelve operation to fail is a separate question, but it's
 >definitely a symptom that something is wrong with resource tracking on this
 >compute node.
 >
 >Chris
 >
 >
 >__
 >
 >OpenStack Development Mailing List (not for usage questions)
 >Unsubscribe: OpenStack-dev-
 >requ...@lists.openstack.org?subject:unsubscribe
 >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
--
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests

2016-05-25 Thread Chris Friesen

On 05/22/2016 05:41 PM, Jay Pipes wrote:

Hello Novaites,

I've noticed that the Intel NFV CI has been failing all test runs for quite some
time (at least a few days), always failing the same tests around shelve/unshelve
operations.





I looked through the conductor and compute logs to see if I could find any
possible reasons for the errors and found a number of the following errors in
the compute logs:

2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last):
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File
"/opt/stack/new/nova/nova/compute/manager.py", line 4230, in _unshelve_instance
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52] with rt.instance_claim(context,
instance, limits):





2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52] newcell.unpin_cpus(pinned_cpus)
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File
"/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus))
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot pin/unpin cpus
[6] from the following pinned set [0, 2, 4]

on or around the time of the failures in Tempest.

Perhaps tomorrow morning we can look into handling the above exception properly
from the compute manager, since clearly we shouldn't be allowing
CPUPinningInvalid to be raised in the resource tracker's _update_usage() 
call


First, it seems wrong to me that an _unshelve_instance() call would result in 
unpinning any CPUs.  If the instance was using pinned CPUs then I would expect 
the CPUs to be unpinned when doing the "shelve" operation.  When we do an 
instance claim as part of the "unshelve" operation we should be pinning CPUs, 
not unpinning them.


Second, the reason why CPUPinningInvalid gets raised in _update_usage() is that 
it has discovered an inconsistency in its view of resources.  In this case, it's 
trying to unpin CPU 6 from a set of pinned cpus that doesn't include CPU 6.  I 
think this is a valid concern and should result in an error log.  Whether it 
should cause the unshelve operation to fail is a separate question, but it's 
definitely a symptom that something is wrong with resource tracking on this 
compute node.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests

2016-05-22 Thread Jay Pipes

Hello Novaites,

I've noticed that the Intel NFV CI has been failing all test runs for 
quite some time (at least a few days), always failing the same tests 
around shelve/unshelve operations.


The shelve/unshelve Tempest tests always result in a timeout exception 
being raised, looking similar to the following, from [1]:


2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base Traceback 
(most recent call last):
2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base   File 
"tempest/api/compute/base.py", line 166, in server_check_teardown
2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base 
cls.server_id, 'ACTIVE')
2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base   File 
"tempest/common/waiters.py", line 95, in wait_for_server_status
2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base raise 
exceptions.TimeoutException(message)
2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base 
TimeoutException: Request timed out
2016-05-22 22:25:30.697 13974 ERROR tempest.api.compute.base Details: 
(ServerActionsTestJSON:tearDown) Server 
cae6fd47-0968-4922-a03e-3f2872e4eb52 failed to reach ACTIVE status and 
task state "None" within the required time (196 s). Current status: 
SHELVED_OFFLOADED. Current task state: None.


I looked through the conductor and compute logs to see if I could find 
any possible reasons for the errors and found a number of the following 
errors in the compute logs:


2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last):
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/compute/manager.py", line 4230, in 
_unshelve_instance
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] with 
rt.instance_claim(context, instance, limits):
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", 
line 271, in inner
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] return f(*args, **kwargs)
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/compute/resource_tracker.py", line 151, in 
instance_claim
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] 
self._update_usage_from_instance(context, instance_ref)
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/compute/resource_tracker.py", line 827, in 
_update_usage_from_instance
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] self._update_usage(instance, 
sign=sign)
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/compute/resource_tracker.py", line 666, in 
_update_usage
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] self.compute_node, usage, free)
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/virt/hardware.py", line 1482, in 
get_host_numa_usage_from_instance
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] host_numa_topology, 
instance_numa_topology, free=free))
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/virt/hardware.py", line 1348, in 
numa_usage_from_instances
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] newcell.unpin_cpus(pinned_cpus)
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52]   File 
"/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus))
2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance: 
cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot 
pin/unpin cpus [6] from the following pinned set [0, 2, 4]


on or around the time of the failures in Tempest.

Perhaps tomorrow morning we can look into handling the above exception 
properly from the compute manager, since clearly we shouldn't be 
allowing CPUPinningInvalid to be raised in the resource tracker's 
_update_usage() call


Anyway, see you on IRC tomorrow morning and let's try to fix this.

Best,
-jay

[1] 
http://intel-openstack-ci-logs.