subject:"\[openstack\-dev\] \[nova\]\[neutron\]\[SR\-IOV\] Hardware changes and shifting PCI addresses"

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

2015-09-15 Thread Chris Friesen


On 09/15/2015 02:25 AM, Daniel P. Berrange wrote:


Taking a host offline for maintenance, should be considered
equivalent to throwing away the existing host and deploying a new
host. There should be zero state carry-over from OpenStack POV,
since both the software and hardware changes can potentially
invalidate previous informationm used by the schedular for deploying
on that host.  The idea of recovering a previously running guest
should be explicitly unsupported.


This isn't the way the nova code is currently written though.

By default, any instances that were running on that compute node are going to 
still be in the DB as running on that compute node but in the "stopped" state. 
If you then do a "nova start", they'll try to start up on that node again.


Heck, if you enable "resume_guests_state_on_host_boot" then nova will restart 
them automatically for you on startup.


To robustly do what you're talking about would require someone (nova, the 
operator, etc.) to migrate all instances off of a compute node before taking it 
down (which is currently impossible for suspended instances), and then force a 
"nova evacuate" (or maybe "nova delete") for every instance that was on a 
compute node that went down.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

2015-09-15 Thread Daniel P. Berrange

On Mon, Sep 14, 2015 at 09:34:31PM -0400, Jay Pipes wrote:
> On 09/10/2015 05:23 PM, Brent Eagles wrote:
> >Hi,
> >
> >I was recently informed of a situation that came up when an engineer
> >added an SR-IOV nic to a compute node that was hosting some guests that
> >had VFs attached. Unfortunately, adding the card shuffled the PCI
> >addresses causing some degree of havoc. Basically, the PCI addresses
> >associated with the previously allocated VFs were no longer valid.
> >
> >I tend to consider this a non-issue. The expectation that hosts have
> >relatively static hardware configuration (and kernel/driver configs for
> >that matter) is the price you pay for having pets with direct hardware
> >access. That being said, this did come as a surprise to some of those
> >involved and I don't think we have any messaging around this or advice
> >on how to deal with situations like this.
> >
> >So what should we do? I can't quite see altering OpenStack to deal with
> >this situation (or even how that could work). Has anyone done any
> >research into this problem, even if it is how to recover or extricate
> >a guest that is no longer valid? It seems that at the very least we
> >could use some stern warnings in the docs.
> 
> Hi Brent,
> 
> Interesting issue. We have code in the PCI tracker that ostensibly handles
> this problem:
> 
> https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L145-L164
> 
> But the note from yjiang5 is telling:
> 
> # Pci properties may change while assigned because of
> # hotplug or config changes. Although normally this should
> # not happen.
> # As the devices have been assigned to a instance, we defer
> # the change till the instance is destroyed. We will
> # not sync the new properties with database before that.
> # TODO(yjiang5): Not sure if this is a right policy, but
> # at least it avoids some confusion and, if
> # we can add more action like killing the instance
> # by force in future.
> 
> Basically, if the PCI device tracker notices that an instance is assigned a
> PCI device with an address that no longer exists in the PCI device addresses
> returned from libvirt, it will (eventually, in the _free_instance() method)
> remove the PCI device assignment from the Instance object, but it will make
> no attempt to assign a new PCI device that meets the original PCI device
> specification in the launch request.
> 
> Should we handle this case and attempt a "hot re-assignment of a PCI
> device"? Perhaps. Is it high priority? Not really, IMHO.

Hotplugging new PCI devices to a running host should not have any impact
on existing PCI device addresses - it'll merely add new adddresses for
new devices - existing devices are unchanged. So Everything should "just
work" in that case. IIUC, Brent's Q was around turning off the host and
cold-plugging/unplugging hardware, which /is/ liable to arbitrarily
re-arrange existing PCI device addresses.

> If you'd like to file a bug against Nova, that would be cool, though.

I think it is explicitly out of scope for Nova to deal with this
scenario.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

2015-09-15 Thread Daniel P. Berrange

On Thu, Sep 10, 2015 at 06:53:06PM -0230, Brent Eagles wrote:
> Hi,
> 
> I was recently informed of a situation that came up when an engineer
> added an SR-IOV nic to a compute node that was hosting some guests that
> had VFs attached. Unfortunately, adding the card shuffled the PCI
> addresses causing some degree of havoc. Basically, the PCI addresses
> associated with the previously allocated VFs were no longer valid.

This seems to be implying that they took the host offline to make
hardware changes, and then tried to re-start the originally running
guests directly, without letting the schedular re-run.

If correct, then IMHO that is an unsupported approach. After making
any hardware changes you should essentially consider that to be a
new compute host. There is no expectation that previously running
guests on that host can be restarted. You must let the compute
host report its new hardware capabilities, and let the schedular
place guests on it from scratch, using the new PCI address info.

> I tend to consider this a non-issue. The expectation that hosts have
> relatively static hardware configuration (and kernel/driver configs for
> that matter) is the price you pay for having pets with direct hardware
> access. That being said, this did come as a surprise to some of those
> involved and I don't think we have any messaging around this or advice
> on how to deal with situations like this.
> 
> So what should we do? I can't quite see altering OpenStack to deal with
> this situation (or even how that could work). Has anyone done any
> research into this problem, even if it is how to recover or extricate
> a guest that is no longer valid? It seems that at the very least we
> could use some stern warnings in the docs.

Taking a host offline for maintenance, should be considered
equivalent to throwing away the existing host and deploying a new
host. There should be zero state carry-over from OpenStack POV,
since both the software and hardware changes can potentially
invalidate previous informationm used by the schedular for deploying
on that host.  The idea of recovering a previously running guest
should be explicitly unsupported.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

2015-09-14 Thread Jay Pipes


On 09/10/2015 05:23 PM, Brent Eagles wrote:

Hi,

I was recently informed of a situation that came up when an engineer
added an SR-IOV nic to a compute node that was hosting some guests that
had VFs attached. Unfortunately, adding the card shuffled the PCI
addresses causing some degree of havoc. Basically, the PCI addresses
associated with the previously allocated VFs were no longer valid.

I tend to consider this a non-issue. The expectation that hosts have
relatively static hardware configuration (and kernel/driver configs for
that matter) is the price you pay for having pets with direct hardware
access. That being said, this did come as a surprise to some of those
involved and I don't think we have any messaging around this or advice
on how to deal with situations like this.

So what should we do? I can't quite see altering OpenStack to deal with
this situation (or even how that could work). Has anyone done any
research into this problem, even if it is how to recover or extricate
a guest that is no longer valid? It seems that at the very least we
could use some stern warnings in the docs.


Hi Brent,

Interesting issue. We have code in the PCI tracker that ostensibly 
handles this problem:


https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L145-L164

But the note from yjiang5 is telling:

# Pci properties may change while assigned because of
# hotplug or config changes. Although normally this should
# not happen.
# As the devices have been assigned to a instance, we defer
# the change till the instance is destroyed. We will
# not sync the new properties with database before that.
# TODO(yjiang5): Not sure if this is a right policy, but
# at least it avoids some confusion and, if
# we can add more action like killing the instance
# by force in future.

Basically, if the PCI device tracker notices that an instance is 
assigned a PCI device with an address that no longer exists in the PCI 
device addresses returned from libvirt, it will (eventually, in the 
_free_instance() method) remove the PCI device assignment from the 
Instance object, but it will make no attempt to assign a new PCI device 
that meets the original PCI device specification in the launch request.


Should we handle this case and attempt a "hot re-assignment of a PCI 
device"? Perhaps. Is it high priority? Not really, IMHO.


If you'd like to file a bug against Nova, that would be cool, though.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

2015-09-14 Thread Sean M. Collins

Brent is our Neutron-Nova liaison - can someone from the SR-IOV team
please respond?

-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

2015-09-10 Thread Brent Eagles

Hi,

I was recently informed of a situation that came up when an engineer
added an SR-IOV nic to a compute node that was hosting some guests that
had VFs attached. Unfortunately, adding the card shuffled the PCI
addresses causing some degree of havoc. Basically, the PCI addresses
associated with the previously allocated VFs were no longer valid. 

I tend to consider this a non-issue. The expectation that hosts have
relatively static hardware configuration (and kernel/driver configs for
that matter) is the price you pay for having pets with direct hardware
access. That being said, this did come as a surprise to some of those
involved and I don't think we have any messaging around this or advice
on how to deal with situations like this.

So what should we do? I can't quite see altering OpenStack to deal with
this situation (or even how that could work). Has anyone done any
research into this problem, even if it is how to recover or extricate
a guest that is no longer valid? It seems that at the very least we
could use some stern warnings in the docs.

Cheers,

Brent


pgpTofx8v4Ts8.pgp
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

Re: [openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

[openstack-dev] [nova][neutron][SR-IOV] Hardware changes and shifting PCI addresses

6 matches

Site Navigation

Mail list logo

Footer information