Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On 1/14/2015 4:03 PM, Matt Riedemann wrote: On 12/8/2014 3:12 PM, Jeremy Stanley wrote: On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote: As Dan Berrangé noted, it's nearly impossible to reproduce this issue independently outside of OpenStack Gating environment. I brought this up at the recently concluded KVM Forum earlier this October. To debug this any further, one of the QEMU block layer developers asked if we can get QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested this too, previously) to get further tracing details. We document thoroughly how to reproduce the environments we use for testing OpenStack. There's nothing rarified about "a Gate run" that anyone with access to a public cloud provider would be unable to reproduce, save being able to run it over and over enough times to expose less frequent failures. FWIW, I myself couldn't reproduce it independently via libvirt alone or via QMP (QEMU Machine Protocol) commands. Dan's workaround ("enable it permanently, except for under the gate") sounds sensible to me. [...] I'm dubious of this as it basically says "we know this breaks sometimes, so we're going to stop testing that it works at all and possibly let it get even more broken, but you should be safe to rely on it anyway." The QA team tries very hard to make our integration testing environment as closely as possible mimic real-world deployment configurations. If these sorts of bugs emerge more often because of, for example, resource constraints in the test environment then it should be entirely likely they'd also be seen in production with the same frequency if run on similarly constrained equipment. And as we've observed in the past, any code path we stop testing quickly accumulates new bugs that go unnoticed until they impact someone's production environment at 3am. Bringing this back up since Jesse Keating in IRC was asking about this again today. Sounds like we've heard from a few people that are running this in labs without problems, maybe they are patching libvirt/qemu, I don't know, but we have other things that we know have broken parts and that's why they run on the experimental queue, e.g. cells, nova + ceph/rbd. We also know we're a bit busted in the ec2 API right now with the latest boto release (2.35.1), so we have a cap on that. These issues are being worked, but regarding this particular way that we've disabled the function (with a version cap in the code), someone has to go in and patch that out, which kind of sucks if they could have just used a config option to enable it at their own risk. That's why I'm proposing something like an [experimental] group. We could put this into the [workarounds] group but this isn't really a workaround for anything so that doesn't really make sense to me. I'd personally be OK with putting it into the [libvirt] group with a warning in the config option help and code that this isn't currently tested in the gate so we aren't sure it's going to work, which we've done for cells and some of the virt drivers, e.g. libvirt on non-x86_64/QEMU systems. I'm going to play with this revert [1] on the Fedora 21 experimental queue which is running libvirt 1.2.9 and qemu 2.1.2. Join me, won't you? :) [1] https://review.openstack.org/#/c/147332/ -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On 12/8/2014 3:12 PM, Jeremy Stanley wrote: On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote: As Dan Berrangé noted, it's nearly impossible to reproduce this issue independently outside of OpenStack Gating environment. I brought this up at the recently concluded KVM Forum earlier this October. To debug this any further, one of the QEMU block layer developers asked if we can get QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested this too, previously) to get further tracing details. We document thoroughly how to reproduce the environments we use for testing OpenStack. There's nothing rarified about "a Gate run" that anyone with access to a public cloud provider would be unable to reproduce, save being able to run it over and over enough times to expose less frequent failures. FWIW, I myself couldn't reproduce it independently via libvirt alone or via QMP (QEMU Machine Protocol) commands. Dan's workaround ("enable it permanently, except for under the gate") sounds sensible to me. [...] I'm dubious of this as it basically says "we know this breaks sometimes, so we're going to stop testing that it works at all and possibly let it get even more broken, but you should be safe to rely on it anyway." The QA team tries very hard to make our integration testing environment as closely as possible mimic real-world deployment configurations. If these sorts of bugs emerge more often because of, for example, resource constraints in the test environment then it should be entirely likely they'd also be seen in production with the same frequency if run on similarly constrained equipment. And as we've observed in the past, any code path we stop testing quickly accumulates new bugs that go unnoticed until they impact someone's production environment at 3am. Bringing this back up since Jesse Keating in IRC was asking about this again today. Sounds like we've heard from a few people that are running this in labs without problems, maybe they are patching libvirt/qemu, I don't know, but we have other things that we know have broken parts and that's why they run on the experimental queue, e.g. cells, nova + ceph/rbd. We also know we're a bit busted in the ec2 API right now with the latest boto release (2.35.1), so we have a cap on that. These issues are being worked, but regarding this particular way that we've disabled the function (with a version cap in the code), someone has to go in and patch that out, which kind of sucks if they could have just used a config option to enable it at their own risk. That's why I'm proposing something like an [experimental] group. We could put this into the [workarounds] group but this isn't really a workaround for anything so that doesn't really make sense to me. I'd personally be OK with putting it into the [libvirt] group with a warning in the config option help and code that this isn't currently tested in the gate so we aren't sure it's going to work, which we've done for cells and some of the virt drivers, e.g. libvirt on non-x86_64/QEMU systems. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On Mon, Dec 08, 2014 at 09:12:24PM +, Jeremy Stanley wrote: > On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote: > > As Dan Berrangé noted, it's nearly impossible to reproduce this issue > > independently outside of OpenStack Gating environment. I brought this up > > at the recently concluded KVM Forum earlier this October. To debug this > > any further, one of the QEMU block layer developers asked if we can get > > QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested > > this too, previously) to get further tracing details. > > We document thoroughly how to reproduce the environments we use for > testing OpenStack. Yep, documentation is appreciated. > There's nothing rarified about "a Gate run" that anyone with access to > a public cloud provider would be unable to reproduce, save being able > to run it over and over enough times to expose less frequent failures. Sure. To be fair, this was actually tried. At the risk of over discussing the topic, allow me to provide a bit more context, quoting Dan's email from an old thread[1] ("Thoughts on the patch test failure rate and moving forward" Jul 23, 2014) here for convenience: "In some of the harder gate bugs I've looked at (especially the infamous 'live snapshot' timeout bug), it has been damn hard to actually figure out what's wrong. AFAIK, no one has ever been able to reproduce it outside of the gate infrastructure. I've even gone as far as setting up identical Ubuntu VMs to the ones used in the gate on a local cloud, and running the tempest tests multiple times, but still can't reproduce what happens on the gate machines themselves :-( As such we're relying on code inspection and the collected log messages to try and figure out what might be wrong. The gate collects alot of info and publishes it, but in this case I have found the published logs to be insufficient - I needed to get the more verbose libvirtd.log file. devstack has the ability to turn this on via an environment variable, but it is disabled by default because it would add 3% to the total size of logs collected per gate job. There's no way for me to get that environment variable for devstack turned on for a specific review I want to test with. In the end I uploaded a change to nova which abused rootwrap to elevate privileges, install extra deb packages, reconfigure libvirtd logging and restart the libvirtd daemon. https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py My next attack is to build a custom QEMU binary and hack nova further so that it can download my custom QEMU binary from a website onto the gate machine and run the test with it. Failing that I'm going to be hacking things to try to attach to QEMU in the gate with GDB and get stack traces. Anything is doable thanks to rootwrap giving us a way to elevate privileges from Nova, but it is a somewhat tedious approach." [1] http://lists.openstack.org/pipermail/openstack-dev/2014-July/041148.html To add to the above, from the bug, you can find in one of the plenty of invocations, the above issue _was_ reproduced once, albiet with questionable likelihood (details in the bug). So, it's not that what you're suggesting was never tried. But, from the above, you can clearly see what kind of convoluted methods you need to resort to. One concrete point from the above: it'd be very useful to have an env variable that can be toggled to enable libvirt/QEMU run under `gdb` for $REVIEW. (Sure, it's a patch that needs to be worked on. . .) [. . .] > The QA team tries very hard to make our integration testing > environment as closely as possible mimic real-world deployment > configurations. If these sorts of bugs emerge more often because of, > for example, resource constraints in the test environment then it > should be entirely likely they'd also be seen in production with the > same frequency if run on similarly constrained equipment. And as we've > observed in the past, any code path we stop testing quickly > accumulates new bugs that go unnoticed until they impact someone's > production environment at 3am. I realize you're raising the point that it should not be taken lightly -- hope the context provided in this email demonstrates that it's not the case. PS: FWIW, I do enable this codepath in my test environments (sure, it's not *representative*), but I'm yet to reproduce the bug. -- /kashyap ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On Dec 8, 2014, at 13:12, Jeremy Stanley wrote: > I'm dubious of this as it basically says "we know this breaks > sometimes, so we're going to stop testing that it works at all and > possibly let it get even more broken, but you should be safe to rely > on it anyway." +1, it seems bad to enable something everywhere *except* the gate. I prefer the original suggestion to include a config option that is by default disabled that a user can enable if they want. From what I understand, the feature works "most of the time" and I don't see why a user is guaranteed not to encounter the same conditions that happen in the gate. For that reason I think it makes sense to be an experimental, opt-in by config, feature. melanie (melwitt) signature.asc Description: Message signed with OpenPGP using GPGMail ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote: > As Dan Berrangé noted, it's nearly impossible to reproduce this issue > independently outside of OpenStack Gating environment. I brought this up > at the recently concluded KVM Forum earlier this October. To debug this > any further, one of the QEMU block layer developers asked if we can get > QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested > this too, previously) to get further tracing details. We document thoroughly how to reproduce the environments we use for testing OpenStack. There's nothing rarified about "a Gate run" that anyone with access to a public cloud provider would be unable to reproduce, save being able to run it over and over enough times to expose less frequent failures. > FWIW, I myself couldn't reproduce it independently via libvirt > alone or via QMP (QEMU Machine Protocol) commands. > > Dan's workaround ("enable it permanently, except for under the > gate") sounds sensible to me. [...] I'm dubious of this as it basically says "we know this breaks sometimes, so we're going to stop testing that it works at all and possibly let it get even more broken, but you should be safe to rely on it anyway." The QA team tries very hard to make our integration testing environment as closely as possible mimic real-world deployment configurations. If these sorts of bugs emerge more often because of, for example, resource constraints in the test environment then it should be entirely likely they'd also be seen in production with the same frequency if run on similarly constrained equipment. And as we've observed in the past, any code path we stop testing quickly accumulates new bugs that go unnoticed until they impact someone's production environment at 3am. -- Jeremy Stanley ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On Fri, Dec 05, 2014 at 02:12:44PM -0600, Matt Riedemann wrote: > > > On 12/5/2014 1:32 PM, Sean Dague wrote: > >On 12/05/2014 01:50 PM, Matt Riedemann wrote: > >>In Juno we effectively disabled live snapshots with libvirt due to bug > >>1334398 [1] failing the gate about 25% of the time. > >> > >>I was going through the Juno release notes today and saw this as a known > >>issue, which reminded me of it and was wondering if there is anything > >>being done about it? As Dan Berrangé noted, it's nearly impossible to reproduce this issue independently outside of OpenStack Gating environment. I brought this up at the recently concluded KVM Forum earlier this October. To debug this any further, one of the QEMU block layer developers asked if we can get QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested this too, previously) to get further tracing details. > >>As I recall, it *works* but it wasn't working under the stress our > >>check/gate system puts on that code path. FWIW, I myself couldn't reproduce it independently via libvirt alone or via QMP (QEMU Machine Protocol) commands. Dan's workaround ("enable it permanently, except for under the gate") sounds sensible to me. > >>One thing I'm thinking is, couldn't we make this an experimental config > >>option and by default it's disabled but we could run it in the > >>experimental queue, or people could use it without having to patch the > >>code to remove the artificial minimum version constraint put in the code. > >> > >>Something like: > >> > >>if CONF.libvirt.live_snapshot_supported: > >># do your thing > >> > >>[1] https://bugs.launchpad.net/nova/+bug/1334398 > > > >So, it works. If you aren't booting / shutting down guests at exactly > >the same time as snapshotting. Tried this exact case independently, and cannot reproduce, as stated by Dan (and others on the bug) in this thread. > >I believe cburgess said in IRC yesterday > >he was going to take another look at it next week. > > > >I'm happy to put this into dansmith's pattented [workarounds] config > >group (coming soon to fix the qemu-convert bug). But I don't think this > >should be a normal libvirt option. > > > > -Sean > > > > Yeah the [workarounds] group Is there any URL where I can read about this more? > is what got me thinking about it too as a config option, otherwise I > think the idea of an [experimental] config group has come up before as > a place to put 'not tested, here be dragons' type stuff. -- /kashyap ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On Fri, Dec 05, 2014 at 12:50:37PM -0600, Matt Riedemann wrote: > In Juno we effectively disabled live snapshots with libvirt due to bug > 1334398 [1] failing the gate about 25% of the time. > > I was going through the Juno release notes today and saw this as a known > issue, which reminded me of it and was wondering if there is anything being > done about it? > > As I recall, it *works* but it wasn't working under the stress our > check/gate system puts on that code path. Yep, I've tried to reproduce the problem in countless different ways and never succeeded, even when replicating the gate test VM config & setup exactly. IOW it is highly load dependant edge case. IMHO we did a disservice to users by disabling this. Based on my experiance trying to reproduce it, is something that would work fine for end users times out of 1. I think we should just put a temporary hack into Nova that only disables the code when running under the gate systems, leaving it enabled for users. > One thing I'm thinking is, couldn't we make this an experimental config > option and by default it's disabled but we could run it in the experimental > queue, or people could use it without having to patch the code to remove the > artificial minimum version constraint put in the code. > > Something like: > > if CONF.libvirt.live_snapshot_supported: ># do your thing I don't really think we need that. Just enable it permanently, except for under the gate. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On 12/5/2014 1:32 PM, Sean Dague wrote: On 12/05/2014 01:50 PM, Matt Riedemann wrote: In Juno we effectively disabled live snapshots with libvirt due to bug 1334398 [1] failing the gate about 25% of the time. I was going through the Juno release notes today and saw this as a known issue, which reminded me of it and was wondering if there is anything being done about it? As I recall, it *works* but it wasn't working under the stress our check/gate system puts on that code path. One thing I'm thinking is, couldn't we make this an experimental config option and by default it's disabled but we could run it in the experimental queue, or people could use it without having to patch the code to remove the artificial minimum version constraint put in the code. Something like: if CONF.libvirt.live_snapshot_supported: # do your thing [1] https://bugs.launchpad.net/nova/+bug/1334398 So, it works. If you aren't booting / shutting down guests at exactly the same time as snapshotting. I believe cburgess said in IRC yesterday he was going to take another look at it next week. I'm happy to put this into dansmith's pattented [workarounds] config group (coming soon to fix the qemu-convert bug). But I don't think this should be a normal libvirt option. -Sean Yeah the [workarounds] group is what got me thinking about it too as a config option, otherwise I think the idea of an [experimental] config group has come up before as a place to put 'not tested, here be dragons' type stuff. -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
On 12/05/2014 01:50 PM, Matt Riedemann wrote: > In Juno we effectively disabled live snapshots with libvirt due to bug > 1334398 [1] failing the gate about 25% of the time. > > I was going through the Juno release notes today and saw this as a known > issue, which reminded me of it and was wondering if there is anything > being done about it? > > As I recall, it *works* but it wasn't working under the stress our > check/gate system puts on that code path. > > One thing I'm thinking is, couldn't we make this an experimental config > option and by default it's disabled but we could run it in the > experimental queue, or people could use it without having to patch the > code to remove the artificial minimum version constraint put in the code. > > Something like: > > if CONF.libvirt.live_snapshot_supported: ># do your thing > > [1] https://bugs.launchpad.net/nova/+bug/1334398 So, it works. If you aren't booting / shutting down guests at exactly the same time as snapshotting. I believe cburgess said in IRC yesterday he was going to take another look at it next week. I'm happy to put this into dansmith's pattented [workarounds] config group (coming soon to fix the qemu-convert bug). But I don't think this should be a normal libvirt option. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev