Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2015-01-14 Thread Matt Riedemann



On 1/14/2015 4:03 PM, Matt Riedemann wrote:



On 12/8/2014 3:12 PM, Jeremy Stanley wrote:

On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:

As Dan Berrangé noted, it's nearly impossible to reproduce this issue
independently outside of OpenStack Gating environment. I brought this up
at the recently concluded KVM Forum earlier this October. To debug this
any further, one of the QEMU block layer developers asked if we can get
QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
this too, previously) to get further tracing details.


We document thoroughly how to reproduce the environments we use for
testing OpenStack. There's nothing rarified about a Gate run that
anyone with access to a public cloud provider would be unable to
reproduce, save being able to run it over and over enough times to
expose less frequent failures.


FWIW, I myself couldn't reproduce it independently via libvirt
alone or via QMP (QEMU Machine Protocol) commands.

Dan's workaround (enable it permanently, except for under the
gate) sounds sensible to me.

[...]

I'm dubious of this as it basically says we know this breaks
sometimes, so we're going to stop testing that it works at all and
possibly let it get even more broken, but you should be safe to rely
on it anyway.

The QA team tries very hard to make our integration testing
environment as closely as possible mimic real-world deployment
configurations. If these sorts of bugs emerge more often because of,
for example, resource constraints in the test environment then it
should be entirely likely they'd also be seen in production with the
same frequency if run on similarly constrained equipment. And as
we've observed in the past, any code path we stop testing quickly
accumulates new bugs that go unnoticed until they impact someone's
production environment at 3am.



Bringing this back up since Jesse Keating in IRC was asking about this
again today. Sounds like we've heard from a few people that are running
this in labs without problems, maybe they are patching libvirt/qemu, I
don't know, but we have other things that we know have broken parts and
that's why they run on the experimental queue, e.g. cells, nova +
ceph/rbd. We also know we're a bit busted in the ec2 API right now with
the latest boto release (2.35.1), so we have a cap on that.

These issues are being worked, but regarding this particular way that
we've disabled the function (with a version cap in the code), someone
has to go in and patch that out, which kind of sucks if they could have
just used a config option to enable it at their own risk.

That's why I'm proposing something like an [experimental] group. We
could put this into the [workarounds] group but this isn't really a
workaround for anything so that doesn't really make sense to me.

I'd personally be OK with putting it into the [libvirt] group with a
warning in the config option help and code that this isn't currently
tested in the gate so we aren't sure it's going to work, which we've
done for cells and some of the virt drivers, e.g. libvirt on
non-x86_64/QEMU systems.



I'm going to play with this revert [1] on the Fedora 21 experimental 
queue which is running libvirt 1.2.9 and qemu 2.1.2.


Join me, won't you? :)

[1] https://review.openstack.org/#/c/147332/

--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2015-01-14 Thread Matt Riedemann



On 12/8/2014 3:12 PM, Jeremy Stanley wrote:

On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:

As Dan Berrangé noted, it's nearly impossible to reproduce this issue
independently outside of OpenStack Gating environment. I brought this up
at the recently concluded KVM Forum earlier this October. To debug this
any further, one of the QEMU block layer developers asked if we can get
QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
this too, previously) to get further tracing details.


We document thoroughly how to reproduce the environments we use for
testing OpenStack. There's nothing rarified about a Gate run that
anyone with access to a public cloud provider would be unable to
reproduce, save being able to run it over and over enough times to
expose less frequent failures.


FWIW, I myself couldn't reproduce it independently via libvirt
alone or via QMP (QEMU Machine Protocol) commands.

Dan's workaround (enable it permanently, except for under the
gate) sounds sensible to me.

[...]

I'm dubious of this as it basically says we know this breaks
sometimes, so we're going to stop testing that it works at all and
possibly let it get even more broken, but you should be safe to rely
on it anyway.

The QA team tries very hard to make our integration testing
environment as closely as possible mimic real-world deployment
configurations. If these sorts of bugs emerge more often because of,
for example, resource constraints in the test environment then it
should be entirely likely they'd also be seen in production with the
same frequency if run on similarly constrained equipment. And as
we've observed in the past, any code path we stop testing quickly
accumulates new bugs that go unnoticed until they impact someone's
production environment at 3am.



Bringing this back up since Jesse Keating in IRC was asking about this 
again today. Sounds like we've heard from a few people that are running 
this in labs without problems, maybe they are patching libvirt/qemu, I 
don't know, but we have other things that we know have broken parts and 
that's why they run on the experimental queue, e.g. cells, nova + 
ceph/rbd. We also know we're a bit busted in the ec2 API right now with 
the latest boto release (2.35.1), so we have a cap on that.


These issues are being worked, but regarding this particular way that 
we've disabled the function (with a version cap in the code), someone 
has to go in and patch that out, which kind of sucks if they could have 
just used a config option to enable it at their own risk.


That's why I'm proposing something like an [experimental] group. We 
could put this into the [workarounds] group but this isn't really a 
workaround for anything so that doesn't really make sense to me.


I'd personally be OK with putting it into the [libvirt] group with a 
warning in the config option help and code that this isn't currently 
tested in the gate so we aren't sure it's going to work, which we've 
done for cells and some of the virt drivers, e.g. libvirt on 
non-x86_64/QEMU systems.


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2014-12-08 Thread Daniel P. Berrange
On Fri, Dec 05, 2014 at 12:50:37PM -0600, Matt Riedemann wrote:
 In Juno we effectively disabled live snapshots with libvirt due to bug
 1334398 [1] failing the gate about 25% of the time.
 
 I was going through the Juno release notes today and saw this as a known
 issue, which reminded me of it and was wondering if there is anything being
 done about it?
 
 As I recall, it *works* but it wasn't working under the stress our
 check/gate system puts on that code path.

Yep, I've tried to reproduce the problem in countless different ways and
never succeeded, even when replicating the gate test VM config  setup
exactly. IOW it is highly load dependant edge case.

IMHO we did a disservice to users by disabling this. Based on my experiance
trying to reproduce it, is something that would work fine for end users 
times out of 1. I think we should just put a temporary hack into Nova
that only disables the code when running under the gate systems, leaving it
enabled for users.

 One thing I'm thinking is, couldn't we make this an experimental config
 option and by default it's disabled but we could run it in the experimental
 queue, or people could use it without having to patch the code to remove the
 artificial minimum version constraint put in the code.
 
 Something like:
 
 if CONF.libvirt.live_snapshot_supported:
# do your thing

I don't really think we need that. Just enable it permanently, except for
under the gate.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2014-12-08 Thread Jeremy Stanley
On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:
 As Dan Berrangé noted, it's nearly impossible to reproduce this issue
 independently outside of OpenStack Gating environment. I brought this up
 at the recently concluded KVM Forum earlier this October. To debug this
 any further, one of the QEMU block layer developers asked if we can get
 QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
 this too, previously) to get further tracing details.

We document thoroughly how to reproduce the environments we use for
testing OpenStack. There's nothing rarified about a Gate run that
anyone with access to a public cloud provider would be unable to
reproduce, save being able to run it over and over enough times to
expose less frequent failures.

 FWIW, I myself couldn't reproduce it independently via libvirt
 alone or via QMP (QEMU Machine Protocol) commands.
 
 Dan's workaround (enable it permanently, except for under the
 gate) sounds sensible to me.
[...]

I'm dubious of this as it basically says we know this breaks
sometimes, so we're going to stop testing that it works at all and
possibly let it get even more broken, but you should be safe to rely
on it anyway.

The QA team tries very hard to make our integration testing
environment as closely as possible mimic real-world deployment
configurations. If these sorts of bugs emerge more often because of,
for example, resource constraints in the test environment then it
should be entirely likely they'd also be seen in production with the
same frequency if run on similarly constrained equipment. And as
we've observed in the past, any code path we stop testing quickly
accumulates new bugs that go unnoticed until they impact someone's
production environment at 3am.
-- 
Jeremy Stanley

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2014-12-08 Thread melanie witt
On Dec 8, 2014, at 13:12, Jeremy Stanley fu...@yuggoth.org wrote:

 I'm dubious of this as it basically says we know this breaks
 sometimes, so we're going to stop testing that it works at all and
 possibly let it get even more broken, but you should be safe to rely
 on it anyway.

+1, it seems bad to enable something everywhere *except* the gate.

I prefer the original suggestion to include a config option that is by default 
disabled that a user can enable if they want.

From what I understand, the feature works most of the time and I don't see 
why a user is guaranteed not to encounter the same conditions that happen in 
the gate. For that reason I think it makes sense to be an experimental, opt-in 
by config, feature.

melanie (melwitt)






signature.asc
Description: Message signed with OpenPGP using GPGMail
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2014-12-08 Thread Kashyap Chamarthy
On Mon, Dec 08, 2014 at 09:12:24PM +, Jeremy Stanley wrote:
 On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:
  As Dan Berrangé noted, it's nearly impossible to reproduce this issue
  independently outside of OpenStack Gating environment. I brought this up
  at the recently concluded KVM Forum earlier this October. To debug this
  any further, one of the QEMU block layer developers asked if we can get
  QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
  this too, previously) to get further tracing details.
 
 We document thoroughly how to reproduce the environments we use for
 testing OpenStack. 

Yep, documentation is appreciated.

 There's nothing rarified about a Gate run that anyone with access to
 a public cloud provider would be unable to reproduce, save being able
 to run it over and over enough times to expose less frequent failures.

Sure. To be fair, this was actually tried. At the risk of over
discussing the topic, allow me to provide a bit more context, quoting
Dan's email from an old thread[1] (Thoughts on the patch test failure
rate and moving forward Jul 23, 2014) here for convenience:

In some of the harder gate bugs I've looked at (especially the
infamous 'live snapshot' timeout bug), it has been damn hard to
actually figure out what's wrong. AFAIK, no one has ever been able
to reproduce it outside of the gate infrastructure. I've even gone
as far as setting up identical Ubuntu VMs to the ones used in the
gate on a local cloud, and running the tempest tests multiple times,
but still can't reproduce what happens on the gate machines
themselves :-( As such we're relying on code inspection and the
collected log messages to try and figure out what might be wrong.

The gate collects alot of info and publishes it, but in this case I
have found the published logs to be insufficient - I needed to get
the more verbose libvirtd.log file. devstack has the ability to turn
this on via an environment variable, but it is disabled by default
because it would add 3% to the total size of logs collected per gate
job.

There's no way for me to get that environment variable for devstack
turned on for a specific review I want to test with. In the end I
uploaded a change to nova which abused rootwrap to elevate
privileges, install extra deb packages, reconfigure libvirtd logging
and restart the libvirtd daemon.

   
https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
   https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py

My next attack is to build a custom QEMU binary and hack nova
further so that it can download my custom QEMU binary from a website
onto the gate machine and run the test with it. Failing that I'm
going to be hacking things to try to attach to QEMU in the gate with
GDB and get stack traces.  Anything is doable thanks to rootwrap
giving us a way to elevate privileges from Nova, but it is a
somewhat tedious approach.


   [1] http://lists.openstack.org/pipermail/openstack-dev/2014-July/041148.html

To add to the above, from the bug, you can find in one of the plenty of
invocations, the above issue _was_ reproduced once, albiet with
questionable likelihood (details in the bug).

So, it's not that what you're suggesting was never tried. But, from the
above, you can clearly see what kind of convoluted methods you need to
resort to.

One concrete point from the above: it'd be very useful to have an env
variable that can be toggled to enable libvirt/QEMU run under `gdb` for
$REVIEW.

(Sure, it's a patch that needs to be worked on. . .)

[. . .]

 The QA team tries very hard to make our integration testing
 environment as closely as possible mimic real-world deployment
 configurations. If these sorts of bugs emerge more often because of,
 for example, resource constraints in the test environment then it
 should be entirely likely they'd also be seen in production with the
 same frequency if run on similarly constrained equipment. And as we've
 observed in the past, any code path we stop testing quickly
 accumulates new bugs that go unnoticed until they impact someone's
 production environment at 3am.

I realize you're raising the point that it should not be taken lightly
-- hope the context provided in this email demonstrates that it's not
the case.


PS: FWIW, I do enable this codepath in my test environments (sure, it's
not *representative*), but I'm yet to reproduce the bug.


-- 
/kashyap

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2014-12-05 Thread Sean Dague
On 12/05/2014 01:50 PM, Matt Riedemann wrote:
 In Juno we effectively disabled live snapshots with libvirt due to bug
 1334398 [1] failing the gate about 25% of the time.
 
 I was going through the Juno release notes today and saw this as a known
 issue, which reminded me of it and was wondering if there is anything
 being done about it?
 
 As I recall, it *works* but it wasn't working under the stress our
 check/gate system puts on that code path.
 
 One thing I'm thinking is, couldn't we make this an experimental config
 option and by default it's disabled but we could run it in the
 experimental queue, or people could use it without having to patch the
 code to remove the artificial minimum version constraint put in the code.
 
 Something like:
 
 if CONF.libvirt.live_snapshot_supported:
# do your thing
 
 [1] https://bugs.launchpad.net/nova/+bug/1334398

So, it works. If you aren't booting / shutting down guests at exactly
the same time as snapshotting. I believe cburgess said in IRC yesterday
he was going to take another look at it next week.

I'm happy to put this into dansmith's pattented [workarounds] config
group (coming soon to fix the qemu-convert bug). But I don't think this
should be a normal libvirt option.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

2014-12-05 Thread Matt Riedemann



On 12/5/2014 1:32 PM, Sean Dague wrote:

On 12/05/2014 01:50 PM, Matt Riedemann wrote:

In Juno we effectively disabled live snapshots with libvirt due to bug
1334398 [1] failing the gate about 25% of the time.

I was going through the Juno release notes today and saw this as a known
issue, which reminded me of it and was wondering if there is anything
being done about it?

As I recall, it *works* but it wasn't working under the stress our
check/gate system puts on that code path.

One thing I'm thinking is, couldn't we make this an experimental config
option and by default it's disabled but we could run it in the
experimental queue, or people could use it without having to patch the
code to remove the artificial minimum version constraint put in the code.

Something like:

if CONF.libvirt.live_snapshot_supported:
# do your thing

[1] https://bugs.launchpad.net/nova/+bug/1334398


So, it works. If you aren't booting / shutting down guests at exactly
the same time as snapshotting. I believe cburgess said in IRC yesterday
he was going to take another look at it next week.

I'm happy to put this into dansmith's pattented [workarounds] config
group (coming soon to fix the qemu-convert bug). But I don't think this
should be a normal libvirt option.

-Sean



Yeah the [workarounds] group is what got me thinking about it too as a 
config option, otherwise I think the idea of an [experimental] config 
group has come up before as a place to put 'not tested, here be dragons' 
type stuff.


--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev